Restore local database for extracted features

ArtPoon commented 1 year ago

We've been burning a lot of CPU re-aligning millions of sequences every time we process a new provision file. Even though this is pretty fast with minimap2, there is a lot of sequences! For the sake of our hardware, we should think about bringing back the database scheme implemented by @ewong347 ages ago. This is the general idea:

while iterating through the lzma-decompressed JSON data provision stream, query each EPI accession against the database.
If it is in the database, continue to the next record.
if it is not, do pairwise alignment to reference via minimap2 as usual, extract features and store in the database along with the metadata we retain from the JSON
since the resulting database will be enormous, we might need to merge sequence records with identical feature vectors (which will probably mean reworking Emmanuel's schema)

This should be implemented in an experimental branch

SandeepThokala commented 1 year ago

Created a temporary gsaid.db file for populating a SEQUENCES table every time a new sequence is given. Updated code in a new branch iss485 Working on merging sequence records with identical feature vectors.

GopiGugan commented 1 year ago

@SandeepThokala to update this issue with the current database schema (tables, fields, etc.)

SandeepThokala commented 1 year ago

Database schema for now:

sequences
- accession (primary key)
- name
- lineage
- date
- location
- sequence
features
- accession
- name
- lineage
- date
- location
- vectors (primary key)

features table is a result of merging records with identical feature vectors from sequences table. For example if there are two records with identical feature vectors, then the merged record in features table would look like..

accession	name	location	lineage	vectors (pk)
accession1,accession2	name1,name2	location1,location2	lineage1,lineage2	fvecs string

ArtPoon commented 1 year ago

I don't think we should store raw sequences in the database (sequences table)
we should have a separate table linking members of a "variant" (a set of sequences with the same feature vector) to the variant ID - alternatively we could store a key to the feature vector in the sequences table

SandeepThokala commented 1 year ago

Facing the following error, if a the same test file is used for a second time.

File "./covizu/batch.py", line 153, in <module>
    timetree, residuals = build_timetree(by_lineage, args, cb.callback)
  File "./covizu/covizu/utils/batch_utils.py", line 87, in build_timetree
    return covizu.treetime.parse_nexus(nexus_file, fasta)
  File "./covizu/covizu/treetime.py", line 160, in parse_nexus
    rmean = statistics.mean(rvals)
  File "/usr/lib/python3.10/statistics.py", line 328, in mean
    raise StatisticsError('mean requires at least one data point')
statistics.StatisticsError: mean requires at least one data point

GopiGugan commented 1 year ago

Facing the following error, if a the same test file is used for a second time.

@SandeepThokala - can you please push your updates to the iss485 branch so we can investigate this issue further?

ArtPoon commented 1 year ago

reprocessing the exact same input data feed is an edge case (we almost always expect there to be new data in production) so we should just have some exception handling for this situation

ArtPoon commented 1 year ago

@SandeepThokala when you have the exception handling in place, can you please post some timing results on (1) processing a data feed de novo (i.e., with an empty database), versus (2) reprocessing the same data feed with a full database. In other words, we need to measure what the overhead cost of querying the database for sample identifiers is.

SandeepThokala commented 1 year ago

Test data used has a total of 2001 records. Time taken with an empty database: 3 mins 31 secs

🏄 [0:03:31.892925] All done!

Time taken with a full database: 7 secs

🏄 [0:00:07.021282]

ArtPoon commented 1 year ago

@SandeepThokala reports that building the database took about 3 minutes for 2,000 sequences on his machine. Extrapolating that out to 10 million records (factor of 5000), that would be roughly 250 hours or about 10 days. We also need to consider the size of the database.
@SandeepThokala can you report on:

additional, more precise timing of building database on Paphlagon
measure size of database for a given number of records

Would there be any advantage (speed or space efficiency) in switching from sqlite to a more advanced database system like postgres?

ArtPoon commented 11 months ago

use nohup <cmd> & to run a process in the background so that if your shell is logged out the process will continue to run
use UNIX time <cmd> to measure computing time precisely

SandeepThokala commented 11 months ago

Timing results on Paphlagon using postgresql for a dataset of 1 million sequences Time taken: 4hrs 50mins

🏄 [0:00:00.754457] Processing GISAID feed data
🏄 [0:00:03.848454] aligned 0 records
🏄 [0:01:48.123322] aligned 10000 records
🏄 [0:03:36.460588] aligned 20000 records
..
..
🏄 [1:18:56.193301] aligned 600000 records
🏄 [1:19:57.546790] aligned 610000 records
🏄 [1:19:57.844676] filtered 758998 problematic features
🏄 [1:19:57.844722]          292531 genomes with excess missing sites
🏄 [1:19:57.844734]          45545 genomes with excess divergence
🏄 [1:19:57.844926] Parsing Pango lineage designations
🏄 [1:19:59.502800] Identifying lineage representative genomes
🏄 [1:20:09.876231] Reconstructing tree with fasttree2
..
..
--- divergence tree saved in nexus format as  
     data/divergence_tree.nexus

🏄 [1:34:08.394258] Recoding features, compressing variants..
🏄 [1:34:27.818366] start MPI on minor lineages
🏄 [1:34:32.566257][100/128] starting AY.4.8
🏄 [1:34:32.598005][96/128] starting B.1.311
..
..
🏄 [3:54:31.695501] start AY.3, 5719 entries
🏄 [3:58:41.902696] Parsing output files
R[write to console]: Error in Edges[, 1] : incorrect number of dimensions
..
..
R[write to console]: Error in optimx.check(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower,  : 
  Function returned is infinite or NA (non-computable)

R[write to console]: Error in optimx.check(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower,  : 
  Function returned is infinite or NA (non-computable)

🏄 [4:50:48.659124] All done!

ArtPoon commented 11 months ago

thanks @SandeepThokala can you also post timing for rerunning the same data feed with the complete database? Also is it possible to compare output data files from both versions?

SandeepThokala commented 11 months ago

Timing results for rerunning the 1 million dataset with complete database

(covizu) [sandeep@Paphlagon covizu]$ time python3 batch.py --infile provision.1m.json.xz --dry-run
🏄 [0:00:00.819510] Processing GISAID feed data
🏄 [0:03:59.062307] filtered 0 problematic features
🏄 [0:03:59.062413]          0 genomes with excess missing sites
...
...
0.00    -TreeAnc: set-up
TreeAnc: tree in /tmp/cvz_tt_fv16_7wj as only 1 tips. Please check your tree!
Tree loading/building failed.
No tree -- exiting.
🏄 [0:04:02.186036] Provided records are already stored.
🏄 [0:04:02.279103] Recoding features, compressing variants..
🏄 [0:04:02.279667] start MPI on minor lineages
🏄 [0:04:06.207955] Parsing output files
🏄 [0:04:08.992934] All done!

real    4m10.655s
user    3m47.634s
sys     5m31.654s

When comparing the clusters.json output file before and after the run, there were no changes detected in the file.

ArtPoon commented 11 months ago

Thanks for posting this timing info. Something strange is going on - the treetime step is complaining about only having a single tip in the tree, which implies that nearly all records are being discarded.
Why are the filters reading zero? Are you checking inputs from the data feed with records in the database?

ArtPoon commented 11 months ago

@GopiGugan investigating nosql database to reduce the number of transactions, i.e., insert operations to multiple tables per record

ArtPoon commented 11 months ago

Once past the initial step of updating the database with new records from the provision feed, we should be streaming JSON from the database with {name, diffs, missing}, with one row per record, to the next stage of the pipeline - i.e., the output from extract_features

ArtPoon commented 11 months ago

@GopiGugan estimates the NoSQL database will require about 15GB to store a JSON of 13 million records.

SandeepThokala commented 10 months ago

Timing results on Paphlagon using postgresql

No. of records	Time (Empty)	Time (Full)
2,000	2m 49s	2m 46s
10,000	8m 00s	8m 01s
100,000	2h 48m 02s	3h 07m 14s

For the infile containing 100,000 sequences, the run took longer with full database. I think this is because of the way I'm storing diffs and missing, which involves JSON strings. https://github.com/PoonLab/covizu/blob/a1c5f3bc0e563281727104a03d484a0d39ec6329/covizu/utils/gisaid_utils.py#L172-L175

ArtPoon commented 10 months ago

The above timings represent running the entire analysis (extracting feature vectors, calculating distances and inferring clusters). We need a breakdown of the time consumed for different stages of the analysis, especially the feature extract (sequence alignment) stage that having a local database is supposed to streamline. @SandeepThokala can you please re-run these tests and measure the time required for each stage, or at least the first step? It might be easier to just run the first step.

ArtPoon commented 10 months ago

If you have saved the console output with timing messages then you should be able to figure out how long it takes to get the step where features have been extracted from aligned genomes, compressed into unique vectors, and sorted by lineage.

SandeepThokala commented 10 months ago

For an infile containing 2,000 records: Console output with empty database. Time taken to get the step where features have been extracted: 04.18s

🏄 [0:00:00.833155] Processing GISAID feed data
🏄 [0:00:03.705643] aligned 0 records
🏄 [0:00:04.183092] filtered 1066 problematic features
🏄 [0:00:04.183195]          671 genomes with excess missing sites
🏄 [0:00:04.183209]          163 genomes with excess divergence
🏄 [0:00:04.183306] Parsing Pango lineage designations
🏄 [0:00:05.930795] Identifying lineage representative genomes
🏄 [0:00:06.017177] Reconstructing tree with fasttree2

Console output with full database. Time taken to get the step where features have been extracted: 02.92s

🏄 [0:00:00.918631] Processing GISAID feed data
🏄 [0:00:02.674425] aligned 0 records
🏄 [0:00:02.926458] filtered 1066 problematic features
🏄 [0:00:02.926625]          671 genomes with excess missing sites
🏄 [0:00:02.926641]          163 genomes with excess divergence
🏄 [0:00:02.926736] Parsing Pango lineage designations
🏄 [0:00:04.859742] Identifying lineage representative genomes
🏄 [0:00:04.926775] Reconstructing tree with fasttree2

SandeepThokala commented 10 months ago

For an infile containing 10,000 records: Console output with empty database. Time taken to get the step where features have been extracted: 21.24s

🏄 [0:00:00.903292] Processing GISAID feed data
🏄 [0:00:03.725789] aligned 0 records
🏄 [0:00:21.246651] filtered 4387 problematic features
🏄 [0:00:21.247484]          5503 genomes with excess missing sites
🏄 [0:00:21.247500]          424 genomes with excess divergence
🏄 [0:00:21.247601] Parsing Pango lineage designations
🏄 [0:00:23.049981] Identifying lineage representative genomes
🏄 [0:00:23.216430] Reconstructing tree with fasttree2

Console output with full database. Time taken to get the step where features have been extracted: 17.76s

🏄 [0:00:00.923452] Processing GISAID feed data
🏄 [0:00:04.967376] aligned 0 records
🏄 [0:00:17.769201] filtered 4387 problematic features
🏄 [0:00:17.769395]          5503 genomes with excess missing sites
🏄 [0:00:17.769414]          424 genomes with excess divergence
🏄 [0:00:17.769542] Parsing Pango lineage designations
🏄 [0:00:19.787453] Identifying lineage representative genomes
🏄 [0:00:19.936859] Reconstructing tree with fasttree2

SandeepThokala commented 10 months ago

For an infile containing 100,000 records: Console output with empty database. Time taken to get the step where features have been extracted: 11m 21.94s

🏄 [0:00:00.891205] Processing GISAID feed data
🏄 [0:00:03.464269] aligned 0 records
🏄 [0:01:00.020073] aligned 10000 records
🏄 [0:02:34.053302] aligned 20000 records
🏄 [0:04:01.141220] aligned 30000 records
🏄 [0:06:13.659268] aligned 40000 records
🏄 [0:09:36.927096] aligned 50000 records
🏄 [0:11:21.948412] filtered 68125 problematic features
🏄 [0:11:21.948540]          38432 genomes with excess missing sites
🏄 [0:11:21.948553]          6358 genomes with excess divergence
🏄 [0:11:21.948665] Parsing Pango lineage designations
🏄 [0:11:23.708734] Identifying lineage representative genomes
🏄 [0:11:25.330877] Reconstructing tree with fasttree2

Console output with full database. Time taken to get the step where features have been extracted: 29m 35.53s

🏄 [0:00:00.760735] Processing GISAID feed data
🏄 [0:00:34.837885] aligned 0 records
🏄 [0:06:27.779398] aligned 10000 records
🏄 [0:12:10.327561] aligned 20000 records
🏄 [0:16:20.266935] aligned 30000 records
🏄 [0:21:52.947085] aligned 40000 records
🏄 [0:27:49.226798] aligned 50000 records
🏄 [0:29:35.536452] filtered 68125 problematic features
🏄 [0:29:35.536590]          38432 genomes with excess missing sites
🏄 [0:29:35.536609]          6358 genomes with excess divergence
🏄 [0:29:35.536694] Parsing Pango lineage designations
🏄 [0:29:37.601106] Identifying lineage representative genomes
🏄 [0:29:39.239793] Reconstructing tree with fasttree2

SandeepThokala commented 10 months ago

Trying to get results for 1 million sequences, on empty database. still running after 8h 47m

🏄 [8:47:48.732020] aligned 300000 records

ArtPoon commented 10 months ago

@SandeepThokala can you make a plot summarizing your timing results (time versus number of sequences)? Also I need you to analyze why re-running from a full database is not faster.

ArtPoon commented 10 months ago

@GopiGugan to repeat experiment with MongoDB (NOSQL) instead of SQL to see if it is faster

ArtPoon commented 10 months ago

@GopiGugan what branch is this database work being done on? I might need to use it to work on #493

GopiGugan commented 10 months ago

@GopiGugan what branch is this database work being done on? I might need to use it to work on #493

We're using the iss485 and iss485-nosql branches for development

SandeepThokala commented 10 months ago

@SandeepThokala can you make a plot summarizing your timing results (time versus number of sequences)? Also I need you to analyze why re-running from a full database is not faster.

covizu-plot

Currently working on analyzing why re-run from full database is not faster

SandeepThokala commented 10 months ago

Sequentially searching for every record using virus name (Line 120), I think, is causing the delay. https://github.com/PoonLab/covizu/blob/a1c5f3bc0e563281727104a03d484a0d39ec6329/covizu/utils/gisaid_utils.py#L116-L121

So, I tried adding an index on the qname column and repeated the experiment. Solid lines represent the results after adding the index.

ArtPoon commented 10 months ago

It's surprising to me that there is not much difference in computing time between the empty and full database scenarios. Are you sure that minimap2 is being run when a record is not found in the database?

SandeepThokala commented 10 months ago

Just got to test on full database with 1m sequences.

ArtPoon commented 9 months ago

@SandeepThokala can you please add timings for running this part of the pipeline WITHOUT the database (aligning everything with minimap2)?

ArtPoon commented 9 months ago

We also need to figure out why running with an empty database is giving us new alignments "for free" (at no additional cost to computing time). This does not make sense to me! the other possibility is that retrieving the alignment results ("feature vectors") from the database for each record is taking so much time that it is roughly equivalent in cost to re-aligning each sequence de novo!

GopiGugan commented 9 months ago

@GopiGugan to repeat experiment with MongoDB (NOSQL) instead of SQL to see if it is faster

Captured the time taken to run the following function: https://github.com/PoonLab/covizu/blob/0254ac493e500d93f57dbb7343941c4dd5bef37c/batch.py#L106-L116

MongoDB results:

Empty Database: Number of Sequences	time(min)
2000	0.0774
10,000	0.3622
100,000	3.7555
1,000,000	33.2657

Populated Database: Number of Sequences	time(min)
2000	0.0282
10,000	0.1390
100,000	1.9545
1,000,000	19.0912

MongoDB timing

ArtPoon commented 9 months ago

I think we might need to be more aggressive about reducing data processing and filter the provisioned JSON stream by record submission date (i.e., cutoff by last CoVizu run date). This bypasses database tranasactions, and it also gives us a way of tracking which lineages have been updated (see #493). The downside of this approach is that any records that have been removed from the database would still be in our data set.

ArtPoon commented 9 months ago

@SandeepThokala can you give this a try with your test data please?

ArtPoon commented 9 months ago

There is a possibility that new records in the database have submission dates in the past, to the degree that they are earlier than our last processing date. We could handle this by extending our date cutoff by a week or more. We could also check records in this extended range for accession numbers.

SandeepThokala commented 9 months ago

Using dataset with 2000 records (dev.2000.json.xz)

I selected a date (2022-01-20) so that there are approximately 600 sequences out of the 2000 which have older submission date. Timing results on empty database

(covizu) [sandeep@Paphlagon covizu]$ python3 batch.py --infile dev.2000.json.xz --mindate 2022-01-20 --dry-run
🏄 [0:00:00.878787] Processing GISAID feed data
🏄 [0:00:02.926217] aligned 0 records
🏄 [0:00:03.350381] filtered 612 problematic features
🏄 [0:00:03.350517]          493 genomes with excess missing sites
🏄 [0:00:03.350567]          142 genomes with excess divergence
Function sort_by_lineage took 2.4718 seconds to execute
🏄 [0:00:03.350973] Parsing Pango lineage designations

populated database

(covizu) [sandeep@Paphlagon covizu]$ python3 batch.py --infile dev.2000.json.xz --mindate 2022-01-20 --dry-run
🏄 [0:00:00.761366] Processing GISAID feed data
🏄 [0:00:01.507638] aligned 0 records
🏄 [0:00:01.682629] filtered 612 problematic features
🏄 [0:00:01.682715]          493 genomes with excess missing sites
🏄 [0:00:01.682726]          142 genomes with excess divergence
Function sort_by_lineage took 0.9213 seconds to execute
🏄 [0:00:01.683012] Parsing Pango lineage designations

SandeepThokala commented 9 months ago

Using dataset with 10,000 records (dev.10000.json.xz)

I selected a date (2022-01-29) so that there are approximately 3,000 sequences out of the 10,000 which have older submission date. Timing results on empty database

(covizu) [sandeep@Paphlagon covizu]$ python3 batch.py --infile dev.10000.json.xz --mindate 2022-01-29 --dry-run
🏄 [0:00:00.923995] Processing GISAID feed data
🏄 [0:00:04.266356] aligned 0 records
🏄 [0:00:12.119832] filtered 3432 problematic features
🏄 [0:00:12.119925]          3820 genomes with excess missing sites
🏄 [0:00:12.119941]          356 genomes with excess divergence
Function sort_by_lineage took 11.1960 seconds to execute
🏄 [0:00:12.120137] Parsing Pango lineage designations

populated database

(covizu) [sandeep@Paphlagon covizu]$ python3 batch.py --infile dev.10000.json.xz --mindate 2022-01-29 --dry-run
🏄 [0:00:01.085012] Processing GISAID feed data
🏄 [0:00:02.514592] aligned 0 records
🏄 [0:00:05.223317] filtered 3432 problematic features
🏄 [0:00:05.223389]          3820 genomes with excess missing sites
🏄 [0:00:05.223402]          356 genomes with excess divergence
Function sort_by_lineage took 4.1384 seconds to execute
🏄 [0:00:05.223610] Parsing Pango lineage designations

ArtPoon commented 9 months ago

@SandeepThokala can you please make sure that the outputs (by_lineage dict) are identical between runs? Can you compare to the nosql database, and get some more detailed timing? What happens when none of the sequences are new?

SandeepThokala commented 9 months ago

Facing the following error at line 160 when providing now new sequences by specifying a recent mindate.

(covizu) [sandeep@Paphlagon covizu]$ nohup python3 batch.py --infile dev.2000.json.xz --mindate 2023-11-26 
Traceback (most recent call last):
  File "/home/sandeep/Documents/covizu/batch.py", line 158, in <module>
    timetree, residuals = build_timetree(by_lineage, args, cb.callback)
  File "/home/sandeep/Documents/covizu/covizu/utils/batch_utils.py", line 87, in build_timetree
    return covizu.treetime.parse_nexus(nexus_file, fasta)
  File "/home/sandeep/Documents/covizu/covizu/treetime.py", line 160, in parse_nexus
    rmean = statistics.mean(rvals)
  File "/home/sandeep/miniconda3/envs/covizu/lib/python3.10/statistics.py", line 328, in mean
    raise StatisticsError('mean requires at least one data point')
statistics.StatisticsError: mean requires at least one data point

https://github.com/PoonLab/covizu/blob/4ab1ba45b919a45d974c0dc58bf64f09c7dfadb1/covizu/treetime.py#L158-L163

SandeepThokala commented 9 months ago

Nosql vs Postgresql timing results covizu-plot (1)

ArtPoon commented 9 months ago

try serializing the dictionary as JSON with a non-zero indent argument to induce line breaks, if you are going to export these as text files for inspection
it might be easier to have both versions of the by_lineage dictionary in memory in the same instance of Python, compare the key sets and then step through the associated values to identify differences

SandeepThokala commented 8 months ago

When a recent mindate is specified:

by_lineage dict is empty.
The code never executes line 145, resulting in empty residuals.
Resulted in statistics.StatisticsError when trying to calculate mean of empty list. https://github.com/PoonLab/covizu/blob/4ab1ba45b919a45d974c0dc58bf64f09c7dfadb1/covizu/treetime.py#L137-L145

When a different outdir is specified along with recent mindate:

timetree function returned None and stored it in nexus_file
Resulted in TypeError when trying to open nexus_file https://github.com/PoonLab/covizu/blob/4ab1ba45b919a45d974c0dc58bf64f09c7dfadb1/covizu/utils/batch_utils.py#L83-L84

SandeepThokala commented 8 months ago

try serializing the dictionary as JSON with a non-zero indent argument to induce line breaks, if you are going to export these as text files for inspection

it might be easier to have both versions of the by_lineage dictionary in memory in the same instance of Python, compare the key sets and then step through the associated values to identify differences

During the initial run, the 'missing' data is stored using lists, whereas on the subsequent run, they are stored using tuples.

SandeepThokala commented 8 months ago

Timing results postgresql vs nosql	Database	Scenario	time (s)	time (s)
Postgres	Empty	4.84	19.63	787
Postgres	Populated	2.1	15.7	1980
NoSQL	Empty	4.9	22.8	214
NoSQL	Populated	2.8	7.8	89

covizu-plot (2)

ArtPoon commented 8 months ago

it's still confusing why the empty nosql database runs are always slower, it would be nice to determine what the limiting step is
can you please add timings for the current version (i.e., no database) to the above plot?

SandeepThokala commented 8 months ago

The previous results were inaccurate due to my lack of experience with git stashing. While switching between the iss485 and iss485-nosql branches, I accidentally lost the changes where I had added an index on the 'qname' column. As a result, the runs on populated database costed more time.

These are the final results.

Database	Scenario	2k	10k	100k	1m
Postgres	Empty	4.35	16.34	155.35	1344.49
Postgres	Populated	1.99	6.05	41.27	338.7
NoSQL	Empty	6.4	24.91	240.41	2074.86
NoSQL	Populated	3.1	8.79	88.72	843.27

covizu-plot (3)

PoonLab / covizu

Restore local database for extracted features #485

Using dataset with 2000 records (dev.2000.json.xz)

Using dataset with 10,000 records (dev.10000.json.xz)