Hoohm / dropSeqPipe

A SingleCell RNASeq pre-processing snakemake workflow
Creative Commons Attribution Share Alike 4.0 International
147 stars 47 forks source link

ConvertToRefFlat fails while running Drop-seq_tools-2.4.0 #113

Open geshtin opened 3 years ago

geshtin commented 3 years ago

I have problems creating the refflat from the gtf file. It has been reported more often, I followed all suggestion, and as far as I can see the gtfs's look OK. I have one where it works:

1   gramene gene    44289   49837   .   +   .   gene_id "Zm00001d027230"; gene_source "gramene"; gene_biotype "protein_coding"; gene_name "Zm00001d027230";
1   gramene transcript  44289   49837   .   +   .   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1   gramene exon    44289   44947   .   +   .   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "1"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon1"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1   gramene CDS 44351   44947   .   +   0   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "1"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1   gramene start_codon 44351   44353   .   +   0   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "1"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1   gramene exon    45666   45803   .   +   .   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "2"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon2"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1   gramene CDS 45666   45803   .   +   0   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "2"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1   gramene exon    45888   46133   .   +   .   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "3"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon3"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1   gramene CDS 45888   46133   .   +   0   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "3"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1   gramene exon    46229   46342   .   +   .   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "4"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon4"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1   gramene CDS 46229   46342   .   +   0   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "4"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1   gramene exon    46451   46633   .   +   .   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "5"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon5"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1   gramene CDS 46451   46633   .   +   0   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "5"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1   gramene exon    47045   47262   .   +   .   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "6"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon6"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1   gramene CDS 47045   47262   .   +   0   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "6"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1   gramene exon    47650   48111   .   +   .   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "7"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon7"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1   gramene CDS 47650   47992   .   +   1   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "7"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; protein_id "Zm00001d027230_P001"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1   gramene stop_codon  47993   47995   .   +   0   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "7"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1   gramene exon    48200   49247   .   +   .   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "8"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon8"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1   gramene exon    49330   49837   .   +   .   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; exon_number "9"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; exon_id "Zm00001d027230_T001.exon9"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1   gramene five_prime_utr  44289   44350   .   +   .   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1   gramene three_prime_utr 47996   48111   .   +   .   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1   gramene three_prime_utr 48200   49247   .   +   .   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";
1   gramene three_prime_utr 49330   49837   .   +   .   gene_id "Zm00001d027230"; transcript_id "Zm00001d027230_T001"; gene_source "gramene"; gene_biotype "protein_coding"; transcript_source "gramene"; transcript_biotype "protein_coding"; gene_name "Zm00001d027230"; transcript_name "Zm00001d027230_T001";

and one, where it fails:
1   araport11   gene    3631    5899    .   +   .   gene_id "AT1G01010"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding";
1   araport11   transcript  3631    5899    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; transcript_name "AT1G01010.1";
1   araport11   exon    3631    3913    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon1"; transcript_name "AT1G01010.1";
1   araport11   CDS 3760    3913    .   +   0   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1   araport11   start_codon 3760    3762    .   +   0   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; transcript_name "AT1G01010.1";
1   araport11   exon    3996    4276    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon2"; transcript_name "AT1G01010.1";
1   araport11   CDS 3996    4276    .   +   2   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1   araport11   exon    4486    4605    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "3"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon3"; transcript_name "AT1G01010.1";
1   araport11   CDS 4486    4605    .   +   0   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "3"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1   araport11   exon    4706    5095    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "4"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon4"; transcript_name "AT1G01010.1";
1   araport11   CDS 4706    5095    .   +   0   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "4"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1   araport11   exon    5174    5326    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "5"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon5"; transcript_name "AT1G01010.1";
1   araport11   CDS 5174    5326    .   +   0   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "5"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1   araport11   exon    5439    5899    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon6"; transcript_name "AT1G01010.1";
1   araport11   CDS 5439    5627    .   +   0   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1"; transcript_name "AT1G01010.1";
1   araport11   stop_codon  5628    5630    .   +   0   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; transcript_name "AT1G01010.1";
1   araport11   five_prime_utr  3631    3759    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; transcript_name "AT1G01010.1";
1   araport11   three_prime_utr 5631    5899    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; transcript_name "AT1G01010.1";

with the following error:

INFO    2020-10-12 16:13:41     ConvertToRefFlat

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    ConvertToRefFlat -ANNOTATIONS_FILE /data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/Arabidopsis_thaliana.TAIR10.48_rvd3b.Pt.Mt.gtf -SEQUENCE_DICTIONARY /data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/TAIR10_chr_all.ShortId.DS.dict -OUTPUT /data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/Arabidopsis_thaliana.TAIR10.48_rvd1b.refFlat
**********

16:13:42.607 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data/annotations/tools/SingleCell/Drop-seq_tools-2.4.0/jar/lib/picard-2.20.5.jar!/com/intel/gkl/native/libgkl_compression.so
16:13:42.618 WARN  NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (No such file or directory)
16:13:42.628 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data/annotations/tools/SingleCell/Drop-seq_tools-2.4.0/jar/lib/picard-2.20.5.jar!/com/intel/gkl/native/libgkl_compression.so
16:13:42.628 WARN  NativeLibraryLoader - Unable to load libgkl_compression.so from native/libgkl_compression.so (No such file or directory)
[Mon Oct 12 16:13:42 CEST 2020] ConvertToRefFlat ANNOTATIONS_FILE=/data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/Arabidopsis_thaliana.TAIR10.48_rvd3b.Pt.Mt.gtf SEQUENCE_DICTIONARY=/data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/TAIR10_chr_all.ShortId.DS.dict OUTPUT=/data/annotations/resources/genomes_rvd/Arabidopsis_thaliana/TAIR10/Arabidopsis_thaliana.TAIR10.48_rvd1b.refFlat    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Mon Oct 12 16:13:42 CEST 2020] Executing as rvd@VM.local on Linux 4.15.0-23-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_171-8u171-b11-0ubuntu0.18.04.1-b11; Deflater: Jdk; Inflater: Jdk; Provider GCS is not available; Picard version: 2.4.0(3d2b3d8_1600201514)
INFO    2020-10-12 16:13:43     GTFParser       Seen many non-increasing record positions. Printing Read-names as well.
[Mon Oct 12 16:13:45 CEST 2020] org.broadinstitute.dropseqrna.annotation.ConvertToRefFlat done. Elapsed time: 0.05 minutes.
Runtime.totalMemory()=2194669568
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
        at org.broadinstitute.dropseqrna.annotation.AnnotationUtils.parseOptionalFields(AnnotationUtils.java:386)
        at org.broadinstitute.dropseqrna.annotation.GTFParser.parseLine(GTFParser.java:114)
        at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:88)
        at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:39)
        at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:71)
        at htsjdk.samtools.util.PeekableIterator.next(PeekableIterator.java:57)
        at org.broadinstitute.dropseqrna.utils.FilteredIterator.next(FilteredIterator.java:92)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
        at org.broadinstitute.dropseqrna.annotation.GeneFromGTFBuilder.gatherByGeneName(GeneFromGTFBuilder.java:205)
        at org.broadinstitute.dropseqrna.annotation.GeneFromGTFBuilder.<init>(GeneFromGTFBuilder.java:48)
        at org.broadinstitute.dropseqrna.annotation.GTFReader.load(GTFReader.java:79)
        at org.broadinstitute.dropseqrna.annotation.GTFReader.load(GTFReader.java:74)
        at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadGTFFile(GeneAnnotationReader.java:67)
        at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadAnnotationsFile(GeneAnnotationReader.java:51)
        at org.broadinstitute.dropseqrna.annotation.GeneAnnotationReader.loadAnnotationsFile(GeneAnnotationReader.java:61)
        at org.broadinstitute.dropseqrna.annotation.ConvertToRefFlat.doWork(ConvertToRefFlat.java:71)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
        at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)

Both gf's have the same fields in column 9. I see: "GTFParser Seen many non-increasing record positions" what does this exactly mean? So, what is wrong here? Is the order of the gene_id, transcript_id, gene_name, transcript_name important. I assume the gene lines do not need a transcript id (btw adding this information to the gene lines did not solve my problems). For transcript_name I simply used transcript_id Also the one entry where there is a ";" in the gene name was removed. And finally, there was a mismatch in the fasta and the gtf wrt the scaffold names (for Mt and Pt) this was also corrected.

Any suggestion is greatly appreciated, Raymond

geshtin commented 3 years ago

ls, fyi. I have been digging into it a bit more and came to some interesting observations. I split up my file into smaller parts to see if somehwere there are some "strange entries" Sometimes I came accross non-unique gene names (that is: two different gene_id entries, that have the same gene_name), But this was not a problem with just a few instances (when there were only, let's say, about 50 instance with duplicate names the refflat was created). But the entire file (more than 250 duplicate names) it crashes. Removing all these duplicate names (replacing all the gene_names by the gene_id entry): running fine. I also noticed that "half entries" so starting at exon as first line (and no corresponding initial gene line): no problem. So setting the VALIDATION_STRINGENCY=STRICT to some other value might also work (I did not test this), What is the limit of these type or errors?? Raymond

seb-mueller commented 3 years ago

Thanks for reporting, indeed, dropseq-tools was apparently developed with human/mouse in mind, which seems to be not what you use (Zea maize?). In particular, some of the annotations out there seems to have missing gene_name entries etc which still seems to be unresolved: https://github.com/broadinstitute/Drop-seq/issues/50

I'll have to go through your error log in detail, but in the meantime, maybe this is of help: https://github.com/Hoohm/dropSeqPipe/issues/109

For Arabidopsis, some names have a ; (so annoying, who puts that in a gene name?!), the issue above addresses this as well. Ultimately, a new version is coming out soon, hopefully solving this issue in one way or another!

geshtin commented 3 years ago

thanks for the response, I saw those issue 50 and 109. so I tried various versions of adding gene_name and transcript_name. In the end adding the gene_name to all lines (if not already there) and transcript_name (to all non gene lines) worked out fine. The underlying problem for me was that the gene_name was not unique! A small number of non-unique entries was fine, but a large number (250) not. So i created unique gene_names and then creating the refflat file worked fine. Indeed a problem with the gtf (or gff's) in general is that many people do strange things with them (like the ; in the gene_name) and often not follow the gtf specifications. Raymond btw, the reformatted entry above looks much better now ;)

seb-mueller commented 3 years ago

Indeed, non unique gene names are a big problem and probably to be brought to attention to the annotation databases/consortia! So good spot. Did you write code to make those gene names unique, if yes, you could add it here or in issue #109 or so? Maybe this can be integrated in the pipeline as an option.

geshtin commented 3 years ago

i wrote some python code (so not Java), and it is not that nice code (true python programmers will have a lot of comments!). Not really suitable for a production environment. But it works for me, and only when the first attribute in the 9th column is the gene_id (often the case fortunately). Now I just overwrite ALL gene_names, preferably you want to keep the "nice" gene names, many people use those to get biological info. So this should be, I think handled. But I don't mind adding the roughly 20 lines of code here if you are still interested!