Closed inti closed 5 years ago
I ended up using gffread
, http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread_ex, but it would be nice a more consistent to be able to have a fast g2gtools gtf2db
step
Hi Inti,
Thanks we know gtf2db
is the slowest process. When I run the whole pipeline for a 1000 Genomes sample, each step took (on 32-core AMD Opteron Processor 6136 w/ 512GB RAM linux box):
vcf2vci
: 10m32s
patch
: 2m5s
transform
: 5m38s
gtf2db
: 32m24s
extract --genes
: 2m31s
extract --transcripts
: 7m31s
extract --exons
: 1m24s
But my goal was not simply extracting genes or transcripts but creating an annotation database for more general use later. I will take a look but I cannot guarantee we'll work on it any time soon. Thanks for using g2gtools
.
Please keep me updated on this. Also try to add -d
or -dd
to toggle on verbose mode and check the running status. Thanks Inti!
Same file a different run
[g2gtools] Processed 625,000 records
[g2gtools] Processed 626,000 records
[g2gtools] Processed 627,000 records
[g2gtools] Processed 628,000 records
[g2gtools] GTF File parsed
[g2gtools] Finalizing database...
[g2gtools] Database created
[g2gtools] Execution complete: 04:12:35.32
Do you know if there is a script on EMASE that will do the same faster? perhaps prepare-emase -G ${REF_FASTA} -g ${REF_GTF} -o ${REF_DIR} -m
, would do the same?
Thanks in advance
Hmm over 4 hours is a bit too long for processing <1M entries. Some system issue? I am not sure how efficient prepare-emase
is compared to g2gtools gtf2db
but you may still wanna give it a try.
Hio, I have found that that step
g2gtools gtf2db
can be the slowest part in generating a diploid transcriptome. To me it seems the process can be done faster by dividing the input GTF file. I wonder whether it is posible to add a feature for merging different databases. In this way one could rung2gtools gtf2db
for subsets of the GTF file and then consolidate into a single db at the end.Thanks in advance,