EDA Optimization (Phase 1)

Overview

This issue serves as a tracker for the work required to deliver the file-based optimization work laid out in the design for a small subset of studies (phase 1) including the mega-study for MapVeu.

Task List

Include any links to related issues here if present

[X] Dan - Generate 12 fake binary files, to scale (simulates 4 variables, 4 filters and 4 ancestor files)
[x] Dan - Write a parallel processor that reads those files, and kick out a stream of IDs (VEuPathDB/lib-eda-subsetting#1)
- [X] benchmark, with and without cache (implement one group)
[x] Dan - Handle multiple variables in tabular output
[X] Ryan - Factor subsetting logic into a repo
[x] Ryan - Update the subsetting service to use new repo
[X] Ryan/Steve - Create file creation tool for all known types of variables
[X] Ryan/Steve - Dump a study
[X] Implement yellow merging nodes
- [X] Expanders - Take ancestor identifiers and expand the into children using descendant's ancestor file
- [X] Reducers - Take a descendant identifier and collapse into ancestor IDs. This should be an "or" operation in that if one of a node's descendants is present that node should be included.
- [X] Output - Outputs the values of a set of identifiers
- [x] Output - Include IDs in the output
[x] Dan - Model the tree in Java
[x] Dan - Manually create a tree using the real study
[x] Dan - Write a tree processor
[x] Dan - Write a tree generator
[x] Dan - Add a subsetting parameter for using file-based solution
[x] Add code in primary tabular API to branch based on study metadata
[ ] Data formatting
- [x] Model dates as longs throughout entire map-reduce workflow
- [ ] Transition to Integer and Float from Long and Double (Optional?)
- [x] Handle multi-value variables
- [x] Discuss design, tall vs. wide
- [x] Ensure file-dumper writes multi-value in determined format
- [x] Ensure reader handles multi-value in determined format
- [ ] Write vocabulary files and write vocab var values as references into vocab files in file dumper (moved to phase 2)
- [x] Add ancestors to ID mapping file
- [x] Output headers
- [x] Allocate ID length based on longest ID in entity
- [ ] Allocate String length based on longest String in variable (moved to phase 2)
[x] Handle multi-filters
[ ] Testing
- [x] Regression Testing
- [x] Write a script that scrapes the logs, saving requests
- [x] Write regression tests
- [x] Run tests on a subset of requests
- [x] Run tests on all requests and aggregate results
- [ ] Performance Testing
[ ] Optimization
- [ ] Dump a separate file with formatted dates to avoid slowdown of formatting dates while subsetting (moved to phase 2)
- [x] Change DualBufferBinaryRecordReader to take a function that pops a record off of the ByteBuffer instead of copying the contents to a new byte array
- [ ] Review efficiency of dumper
[ ] Bugs
- [x] Requesting a variable, specifying the incorrect entity in the URL, yields unexpected results
- [x] Close streams
[ ] Local development?
[ ] Workflow Integration?

VEuPathDB / EdaSubsettingService

EDA Optimization (Phase 1) #67

Overview

Task List