Closed IanMayo closed 4 years ago
I checked the pseudocode in Python Import Library document, and here are a few thoughts came up my mind:
So, they would import read_csv for CSV files and we wouldn't need to iterate over importers
They'll learn more about the what they want once they get chance to play with working software. But, one scenario they expect is that they get a zip file with a range of file types inside it. They just want our lib to process all the files it can. That fits the model in the unit-test: I register a number of importers, then point it at a folder of mixed data, and let it process it all.
By the way, I think the most important thing about these importers would be preprocessing and getting the same format no matter their sources.
That kind of matches my expectations. We're going to have intermediate objects for measurements. This means:
Validate
step once the parsers(s) are complete, where we offer the analyst a graph of the data loaded, before it goes to the databasebig push
, which should be more efficientStoring these files under DataStore's Datafile is still a question for me. As I said before, If we had proper preprocessing methods for all importers, we wouldn't need duplicated add methods.
I'm keen to understand this a bit more. Could you explain the preprocessing methods?
I'm keen to understand this a bit more. Could you explain the preprocessing methods?
I think I can give an example of a .rep file. In the database, there is location field which is a string. But in .rep files, latitude and longitude are parsed and send to _add_to_sensors_fromrep(...,lat, long,..). If there was a preprocessing method that makes the data proper to the database table (let's assume that it converts latitude and longitude to a string called location), we could send it to _add_tosensors(..., location,...).
So, what I'm suggesting is that having add_to_xxxx methods for each table, which only accepts fields in the database and making intermedia objects proper before calling these methods.
Aah, got you. It will actually be simpler than that :-)
DataFile.createState(sensor, timestamp)
returns an intermediate State
object:
https://docs.google.com/document/d/1RW148XW4Iqr1mwEk5TezmgBf8z9zhZ-T88flk9stFiY/edit#heading=h.l84oqhpe01u4
A record in the States
table only has two compulsory fields, and in the constructor we've provided them already.
For the State object that gets returned, it can have overloaded setters:
But, it will only store point
internally, and have one getter:
So, when we finally push data to the database we will just loop through those State
objects, and call _add_tosensors(state:State)
So - yes, we're in agreement. We'll collate/tidy the data before we start interacting with the database. By storing measurements in these intermediate classes, we can forget about the actual parser used. We have an array of internally consistent State
objects, and we can run some states to look for erroneous data, statistical outliers, etc.
Ian's had a go at this pre-processing. See how this method API expects a State2
object, rather than discrete fields:
https://github.com/debrief/pepys-import/pull/40/files#diff-81972b9b3de076b4b34c0be6a42c66b1R445
Also note that it returns with location (lat,long)
objects rather than separate get_lat()
and get_long()
methods.
Closing. I have a working implementation that is satisfactory for this phase.
Library users will create importers, and register them with our library.
Then they will instruct our library to load a single file, or to recurse through a set of folders.
For each file, we will check with each importer if it can handle that kind of file. To allow this, importers will either provide some metadata describing their capability, or will implement a set of methods that enable it to determine if it can load that file.
Here are an initial set of tests:
;;DEBRIEF DATA
)SERIAL 1 Data:
)This task is to produce a proposal (with pseudocode) for how to implement the above
The proposal will be discussed by the project team.