cdisc-org / cdisc-rules-engine

Open source offering of the cdisc rules engine
MIT License
45 stars 12 forks source link

Getting the size of large XPT datasets #690

Open gerrycampion opened 2 months ago

gerrycampion commented 2 months ago

If a dataset is large, but still small enough to fit into memory, currently the engine uses pandas and reads every record to determine the number of records. This is a problem because the number of records is treated as dataset metadata, which should be quick to fetch, but it can take a long time to fetch. We should find a better approach to get the record count. If none of the libraries provide a convenient way to calculate the number of records, we can implement our own approach.

Refer to the specifications for XPT files: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/movefile/p0ld1i106e1xm7n16eefi7qgj8m9.htm https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/movefile/n0167z9rttw8dyn15z1qqe8eiwzf.htm

A possible solution: Read the header information to find the size of variables, size of records, and start of record data (start). Read the last chunk of data to find out how much trailing ASCII blank padding there is Calculate number of records using (total_size - start - padding) / record_size

Some libraries to check for existing functionality (there may be others): xport.v56 pyreadstat pandas.read_sas

gerrycampion commented 2 months ago

currently, this is incorrectly calculated here: https://github.com/cdisc-org/cdisc-rules-engine/blob/a9edcfb26d3235fac825919c6df3f79ad2021f85/cdisc_rules_engine/services/datasetxpt_metadata_reader.py#L81

JozefAerts commented 2 months ago

Hi Gerry, Just for your info: the header of the XPT file does not contain the number of records. It does however contain information about the record length, which is fixed, i.e. the same for each record. What I usually do is to read the file in a streaming way (so no storage in memory) and count the number of records "passing by". You can chat with Lex Jansen about possible solutions, as he is the real specialist.

gerrycampion commented 2 months ago

Hi Gerry, Just for your info: the header of the XPT file does not contain the number of records. It does however contain information about the record length, which is fixed, i.e. the same for each record. What I usually do is to read the file in a streaming way (so no storage in memory) and count the number of records "passing by". You can chat with Lex Jansen about possible solutions, as he is the real specialist.

@JozefAerts , The purpose of this ticket is to avoid streaming the entire dataset. Instead, we would read the header to get the record length and the start index of the first record and we would read the final 80-byte chunk to get the final index of the last record (before the ascii blank padding). From this, we can divide the record length into the total size of the records' space to get the record count. This is much faster than streaming the entire dataset.

JozefAerts commented 2 months ago

Smart! Is it possible in Python reading a file backwards from the end? I did not know that ... I now see that there is something like that in Java: ReversedLinesFileReader ...

gerrycampion commented 2 months ago

I think it is available in most languages as long as the filesystem supports it. It is called random access. In python, it looks like you would use seek. Java also has RandomAccessFile built-in.

gerrycampion commented 1 month ago

@SFJohnson24 this says ready to review, but there is no PR attached

SFJohnson24 commented 1 month ago

@gerrycampion I am working on this, it was moved to review by mistake.