Annotate big mutation file w/o failure

inodb commented 1 year ago

We can use this file to test:

https://github.com/cBioPortal/datahub/blob/master/public/difg_glass_2019/data_mutations.txt

rmadupuri commented 1 year ago

Another test file if needed:

https://github.com/cBioPortal/datahub/blob/10a57682c24513b9eec3943378169c9b1bc6f3e6/public/pog570_bcgsc_2020/data_mutations.txt

ozguzMete commented 1 year ago

I have used 20G heap size (probably I was going to use a lot more) for a ~1.5G file (pog570_bcgsc_2020/data_mutations.txt) just to load... We shouldn't touch a single character of a line until we really need it because it actually takes 3-5 secs to load that file using a buffered reader

ozguzMete commented 1 year ago

its runtime-wise problem is solved by #227 its memory-wise problem could be solved by not using a giant Map<String, VariantAnnotation> gnResponseVariantKeyMap

ozguzMete commented 1 year ago

@inodb @rmadupuri @sheridancbio

For some reason we use giant Map<String, VariantAnnotation> gnResponseVariantKeyMap do we really need this? I have my doubts...

let me show you what's going on step by step

we sorted and partitioned our query data
for each partition we send a POST request
each POST request returns a list of OriginalVariantQuery that is produced by each data in the partition
we store the items of this list in gnResponseVariantKeyMap
we repeat steps 2-4 until there is no partition left
we start to use this map in order to convert each mutation record into an annotated record

these steps suggest that there should be fewer OriginalVariantQuery than genomicLocations and for some reason, we should use the last inserted OriginalVariantQuery

It sounds to me that this is unnecessary. These steps should be converted into this:

we sorted and partitioned our query data
for each partition we send a POST request
each POST request returns a list of OriginalVariantQuery that is produced by each data in the partition
we store the items of this list in a new data structure
we start to use this data structure in order to convert each mutation record into an annotated record
repeat 2-5

and now, garbage collector can start to clean unused POST response data also we can introduce multi threading (without solving memory issue this will be "meh")

If these steps can't be changed, using a smaller version VariantAnnotation 'might' help

genome-nexus / genome-nexus-annotation-pipeline

Annotate big mutation file w/o failure #216