This issue serves as a tracker for the work required to deliver the file-based optimization work laid out in the design for a small subset of studies (phase 1) including the mega-study for MapVeu.
Task List
Include any links to related issues here if present
[X] Dan - Generate 12 fake binary files, to scale (simulates 4 variables, 4 filters and 4 ancestor files)
[x] Dan - Write a parallel processor that reads those files, and kick out a stream of IDs (VEuPathDB/lib-eda-subsetting#1)
[X] benchmark, with and without cache (implement one group)
[x] Dan - Handle multiple variables in tabular output
[X] Ryan - Factor subsetting logic into a repo
[x] Ryan - Update the subsetting service to use new repo
[X] Ryan/Steve - Create file creation tool for all known types of variables
[X] Ryan/Steve - Dump a study
[X] Implement yellow merging nodes
[X] Expanders - Take ancestor identifiers and expand the into children using descendant's ancestor file
[X] Reducers - Take a descendant identifier and collapse into ancestor IDs. This should be an "or" operation in that if one of a node's descendants is present that node should be included.
[X] Output - Outputs the values of a set of identifiers
[x] Output - Include IDs in the output
[x] Dan - Model the tree in Java
[x] Dan - Manually create a tree using the real study
[x] Dan - Write a tree processor
[x] Dan - Write a tree generator
[x] Dan - Add a subsetting parameter for using file-based solution
[x] Add code in primary tabular API to branch based on study metadata
[ ] Data formatting
[x] Model dates as longs throughout entire map-reduce workflow
[ ] Transition to Integer and Float from Long and Double (Optional?)
[x] Handle multi-value variables
[x] Discuss design, tall vs. wide
[x] Ensure file-dumper writes multi-value in determined format
[x] Ensure reader handles multi-value in determined format
[ ] Write vocabulary files and write vocab var values as references into vocab files in file dumper (moved to phase 2)
[x] Add ancestors to ID mapping file
[x] Output headers
[x] Allocate ID length based on longest ID in entity
[ ] Allocate String length based on longest String in variable (moved to phase 2)
[x] Handle multi-filters
[ ] Testing
[x] Regression Testing
[x] Write a script that scrapes the logs, saving requests
[x] Write regression tests
[x] Run tests on a subset of requests
[x] Run tests on all requests and aggregate results
[ ] Performance Testing
[ ] Optimization
[ ] Dump a separate file with formatted dates to avoid slowdown of formatting dates while subsetting (moved to phase 2)
[x] Change DualBufferBinaryRecordReader to take a function that pops a record off of the ByteBuffer instead of copying
the contents to a new byte array
[ ] Review efficiency of dumper
[ ] Bugs
[x] Requesting a variable, specifying the incorrect entity in the URL, yields unexpected results
Overview
Design
This issue serves as a tracker for the work required to deliver the file-based optimization work laid out in the design for a small subset of studies (phase 1) including the mega-study for MapVeu.
Task List
Include any links to related issues here if present