genome-nexus / genome-nexus-annotation-pipeline

Library and tool for annotating MAF files using Genome Nexus Webserver API
MIT License
8 stars 25 forks source link

Annotate big mutation file w/o failure #216

Open inodb opened 1 year ago

inodb commented 1 year ago

We can use this file to test:

https://github.com/cBioPortal/datahub/blob/master/public/difg_glass_2019/data_mutations.txt

rmadupuri commented 1 year ago

Another test file if needed:

https://github.com/cBioPortal/datahub/blob/10a57682c24513b9eec3943378169c9b1bc6f3e6/public/pog570_bcgsc_2020/data_mutations.txt

ozguzMete commented 1 year ago

I have used 20G heap size (probably I was going to use a lot more) for a ~1.5G file (pog570_bcgsc_2020/data_mutations.txt) just to load... We shouldn't touch a single character of a line until we really need it because it actually takes 3-5 secs to load that file using a buffered reader

ozguzMete commented 1 year ago

its runtime-wise problem is solved by #227 its memory-wise problem could be solved by not using a giant Map<String, VariantAnnotation> gnResponseVariantKeyMap

ozguzMete commented 1 year ago

@inodb @rmadupuri @sheridancbio

For some reason we use giant Map<String, VariantAnnotation> gnResponseVariantKeyMap do we really need this? I have my doubts...

let me show you what's going on step by step

  1. we sorted and partitioned our query data
  2. for each partition we send a POST request
  3. each POST request returns a list of OriginalVariantQuery that is produced by each data in the partition
  4. we store the items of this list in gnResponseVariantKeyMap
  5. we repeat steps 2-4 until there is no partition left
  6. we start to use this map in order to convert each mutation record into an annotated record

these steps suggest that there should be fewer OriginalVariantQuery than genomicLocations and for some reason, we should use the last inserted OriginalVariantQuery

It sounds to me that this is unnecessary. These steps should be converted into this:

  1. we sorted and partitioned our query data
  2. for each partition we send a POST request
  3. each POST request returns a list of OriginalVariantQuery that is produced by each data in the partition
  4. we store the items of this list in a new data structure
  5. we start to use this data structure in order to convert each mutation record into an annotated record
  6. repeat 2-5

and now, garbage collector can start to clean unused POST response data also we can introduce multi threading (without solving memory issue this will be "meh")

If these steps can't be changed, using a smaller version VariantAnnotation 'might' help