debrief / pepys-import

Support library for Pepys maritime data analysis environment
https://pepys-import.readthedocs.io/
Apache License 2.0
5 stars 5 forks source link

Design logic for parsing multiple files #4

Closed IanMayo closed 4 years ago

IanMayo commented 4 years ago

Library users will create importers, and register them with our library.

Then they will instruct our library to load a single file, or to recurse through a set of folders.

For each file, we will check with each importer if it can handle that kind of file. To allow this, importers will either provide some metadata describing their capability, or will implement a set of methods that enable it to determine if it can load that file.

Here are an initial set of tests:

This task is to produce a proposal (with pseudocode) for how to implement the above

The proposal will be discussed by the project team.

BarisSari commented 4 years ago

I checked the pseudocode in Python Import Library document, and here are a few thoughts came up my mind:

IanMayo commented 4 years ago

So, they would import read_csv for CSV files and we wouldn't need to iterate over importers

They'll learn more about the what they want once they get chance to play with working software. But, one scenario they expect is that they get a zip file with a range of file types inside it. They just want our lib to process all the files it can. That fits the model in the unit-test: I register a number of importers, then point it at a folder of mixed data, and let it process it all.

By the way, I think the most important thing about these importers would be preprocessing and getting the same format no matter their sources.

That kind of matches my expectations. We're going to have intermediate objects for measurements. This means:

Storing these files under DataStore's Datafile is still a question for me. As I said before, If we had proper preprocessing methods for all importers, we wouldn't need duplicated add methods.

I'm keen to understand this a bit more. Could you explain the preprocessing methods?

BarisSari commented 4 years ago

I'm keen to understand this a bit more. Could you explain the preprocessing methods?

I think I can give an example of a .rep file. In the database, there is location field which is a string. But in .rep files, latitude and longitude are parsed and send to _add_to_sensors_fromrep(...,lat, long,..). If there was a preprocessing method that makes the data proper to the database table (let's assume that it converts latitude and longitude to a string called location), we could send it to _add_tosensors(..., location,...).

So, what I'm suggesting is that having add_to_xxxx methods for each table, which only accepts fields in the database and making intermedia objects proper before calling these methods.

IanMayo commented 4 years ago

Aah, got you. It will actually be simpler than that :-)

DataFile.createState(sensor, timestamp) returns an intermediate State object: https://docs.google.com/document/d/1RW148XW4Iqr1mwEk5TezmgBf8z9zhZ-T88flk9stFiY/edit#heading=h.l84oqhpe01u4

A record in the States table only has two compulsory fields, and in the constructor we've provided them already.

For the State object that gets returned, it can have overloaded setters:

But, it will only store point internally, and have one getter:

So, when we finally push data to the database we will just loop through those State objects, and call _add_tosensors(state:State)

So - yes, we're in agreement. We'll collate/tidy the data before we start interacting with the database. By storing measurements in these intermediate classes, we can forget about the actual parser used. We have an array of internally consistent State objects, and we can run some states to look for erroneous data, statistical outliers, etc.

IanMayo commented 4 years ago

Ian's had a go at this pre-processing. See how this method API expects a State2 object, rather than discrete fields: https://github.com/debrief/pepys-import/pull/40/files#diff-81972b9b3de076b4b34c0be6a42c66b1R445

Also note that it returns with location (lat,long) objects rather than separate get_lat()and get_long() methods.

IanMayo commented 4 years ago

Closing. I have a working implementation that is satisfactory for this phase.