For the moment, the code just reads files out of a local file system directory "resources". There is some redundancy and room for error because there are multiple instances of this directory. Other code in the foundry has used datasets with multiple files, and some with single files.
A reasonable guess is that the team working on fetching the files will deliver them in datasets in HDFS. What is important is that we agree and that we have a reasonable definition of batches, as this impacts the pre-OMOP and post-OMOP DQ work. We need to be able to compare the same batch of documents on either side. A step in that direction is to make sure they are using the same dataset of files.
A further step would involve either aggregating them, or maintaining a list of the collection of all documents.
TODO: for starters, as Palantir folks figure this out, get with them and organize our test data from the resources directory in a similar way.
For the moment, the code just reads files out of a local file system directory "resources". There is some redundancy and room for error because there are multiple instances of this directory. Other code in the foundry has used datasets with multiple files, and some with single files.
A reasonable guess is that the team working on fetching the files will deliver them in datasets in HDFS. What is important is that we agree and that we have a reasonable definition of batches, as this impacts the pre-OMOP and post-OMOP DQ work. We need to be able to compare the same batch of documents on either side. A step in that direction is to make sure they are using the same dataset of files.
A further step would involve either aggregating them, or maintaining a list of the collection of all documents.
TODO: for starters, as Palantir folks figure this out, get with them and organize our test data from the resources directory in a similar way.