churchill-lab / g2gtools

Personal diploid genome creation and coordinate conversion
http://churchill-lab.github.io/g2gtools
21 stars 9 forks source link

New Feature - to overcome slow gtf2db #12

Closed inti closed 5 years ago

inti commented 6 years ago

Hio, I have found that that step g2gtools gtf2db can be the slowest part in generating a diploid transcriptome. To me it seems the process can be done faster by dividing the input GTF file. I wonder whether it is posible to add a feature for merging different databases. In this way one could run g2gtools gtf2db for subsets of the GTF file and then consolidate into a single db at the end.

Thanks in advance,

inti commented 6 years ago

I ended up using gffread, http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread_ex, but it would be nice a more consistent to be able to have a fast g2gtools gtf2db step

kbchoi-jax commented 6 years ago

Hi Inti,

Thanks we know gtf2db is the slowest process. When I run the whole pipeline for a 1000 Genomes sample, each step took (on 32-core AMD Opteron Processor 6136 w/ 512GB RAM linux box):

vcf2vci: 10m32s patch: 2m5s transform: 5m38s gtf2db: 32m24s extract --genes: 2m31s extract --transcripts: 7m31s extract --exons: 1m24s

But my goal was not simply extracting genes or transcripts but creating an annotation database for more general use later. I will take a look but I cannot guarantee we'll work on it any time soon. Thanks for using g2gtools.

kbchoi-jax commented 6 years ago

Please keep me updated on this. Also try to add -d or -dd to toggle on verbose mode and check the running status. Thanks Inti!

inti commented 6 years ago

Same file a different run

[g2gtools] Processed 625,000 records
[g2gtools] Processed 626,000 records
[g2gtools] Processed 627,000 records
[g2gtools] Processed 628,000 records
[g2gtools] GTF File parsed
[g2gtools] Finalizing database...
[g2gtools] Database created
[g2gtools] Execution complete: 04:12:35.32

Do you know if there is a script on EMASE that will do the same faster? perhaps prepare-emase -G ${REF_FASTA} -g ${REF_GTF} -o ${REF_DIR} -m, would do the same?

Thanks in advance

kbchoi-jax commented 6 years ago

Hmm over 4 hours is a bit too long for processing <1M entries. Some system issue? I am not sure how efficient prepare-emase is compared to g2gtools gtf2db but you may still wanna give it a try.