Design logic for parsing multiple files

debrief / pepys-import

Support library for Pepys maritime data analysis environment

https://pepys-import.readthedocs.io/

Apache License 2.0

5 stars 5 forks source link

Design logic for parsing multiple files #4

Closed IanMayo closed 4 years ago

IanMayo commented 4 years ago

Library users will create importers, and register them with our library.

Then they will instruct our library to load a single file, or to recurse through a set of folders.

For each file, we will check with each importer if it can handle that kind of file. To allow this, importers will either provide some metadata describing their capability, or will implement a set of methods that enable it to determine if it can load that file.

Here are an initial set of tests:

suffix (.rep)
pattern in filename (regex)
first line . (;;DEBRIEF DATA)
presence of some marker line, appearing further down file . (SERIAL 1 Data:)

This task is to produce a proposal (with pseudocode) for how to implement the above

The proposal will be discussed by the project team.

BarisSari commented 4 years ago

I checked the pseudocode in Python Import Library document, and here are a few thoughts came up my mind:

It looks like users/analysts will know what kind of file they have and what kind of import they need to use (for example they have REP files and therefore they need REP Importer). If it's the case, I think we can create an io module under _pepysimport/ and create methods such as CSV Importer (read_csv, preprocess, can_load_this_file), REP Import (read_rep, preprocess, can_load_this_file), etc. So, they would import read_csv for CSV files and we wouldn't need to iterate over importers. By the way, I think the most important thing about these importers would be preprocessing and getting the same format no matter their sources.
Storing these files under DataStore's Datafile is still a question for me. As I said before, If we had proper preprocessing methods for all importers, we wouldn't need duplicated add methods.

IanMayo commented 4 years ago

So, they would import read_csv for CSV files and we wouldn't need to iterate over importers

They'll learn more about the what they want once they get chance to play with working software. But, one scenario they expect is that they get a zip file with a range of file types inside it. They just want our lib to process all the files it can. That fits the model in the unit-test: I register a number of importers, then point it at a folder of mixed data, and let it process it all.

By the way, I think the most important thing about these importers would be preprocessing and getting the same format no matter their sources.

That kind of matches my expectations. We're going to have intermediate objects for measurements. This means:

we don't need to worry about how parsers do their job we just know they create these objects
we can introduce a Validate step once the parsers(s) are complete, where we offer the analyst a graph of the data loaded, before it goes to the database
we can push the 1000s of measurements to the database in one big push, which should be more efficient

Storing these files under DataStore's Datafile is still a question for me. As I said before, If we had proper preprocessing methods for all importers, we wouldn't need duplicated add methods.

I'm keen to understand this a bit more. Could you explain the preprocessing methods?

BarisSari commented 4 years ago

I'm keen to understand this a bit more. Could you explain the preprocessing methods?

I think I can give an example of a .rep file. In the database, there is location field which is a string. But in .rep files, latitude and longitude are parsed and send to _add_to_sensors_fromrep(...,lat, long,..). If there was a preprocessing method that makes the data proper to the database table (let's assume that it converts latitude and longitude to a string called location), we could send it to _add_tosensors(..., location,...).

So, what I'm suggesting is that having add_to_xxxx methods for each table, which only accepts fields in the database and making intermedia objects proper before calling these methods.

IanMayo commented 4 years ago

Aah, got you. It will actually be simpler than that :-)

DataFile.createState(sensor, timestamp) returns an intermediate State object: https://docs.google.com/document/d/1RW148XW4Iqr1mwEk5TezmgBf8z9zhZ-T88flk9stFiY/edit#heading=h.l84oqhpe01u4

A record in the States table only has two compulsory fields, and in the constructor we've provided them already.

For the State object that gets returned, it can have overloaded setters:

setLocation(latitude:float, longitude:float)
setLocation(lat_degs, lat_mins, lat_sec, lat_hemi, long_degs, long_mins, long_secs, long_hemi)
setLocation(location: point)

But, it will only store point internally, and have one getter:

getLocation():point

So, when we finally push data to the database we will just loop through those State objects, and call _add_tosensors(state:State)

So - yes, we're in agreement. We'll collate/tidy the data before we start interacting with the database. By storing measurements in these intermediate classes, we can forget about the actual parser used. We have an array of internally consistent State objects, and we can run some states to look for erroneous data, statistical outliers, etc.

IanMayo commented 4 years ago

Ian's had a go at this pre-processing. See how this method API expects a State2 object, rather than discrete fields: https://github.com/debrief/pepys-import/pull/40/files#diff-81972b9b3de076b4b34c0be6a42c66b1R445

Also note that it returns with location (lat,long) objects rather than separate get_lat()and get_long() methods.

IanMayo commented 4 years ago

Closing. I have a working implementation that is satisfactory for this phase.