NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit
GNU General Public License v3.0
431 stars 52 forks source link

Converting .gff to .gtf for CellRanger #434

Closed matejasoretic closed 3 months ago

matejasoretic commented 4 months ago

Describe the bug Hello, I am having issues with getting agat_convert_sp_gff2gtf.pl to convert .gff files from the NCBI to .gtf files which are compatible with CellRanger. Specifically, I am trying to use it on the RefSeq .gff file found here https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_002197715.1/

When I try running agat_convert_sp_gff2gtf.pl -gff genomic.gff -o genomic.gtf --gtf_version 2.5 I get an output gtf file but when I try running cellranger mkgtf on it:

cellranger mkgtf genomic.gtf genomic.filtered.gtf --attribute=gene_biotype:protein_coding

cellranger.reference.GtfParseError: Error while parsing GTF file path/to/genomic.gtf
Error parsing GTF at line 13.  Parsed attribute had a quote in the middle of a value.  Please ensure quotes are only used to encapsulate attribute values.
 Bad Attribute Value = ID 

What command or preprocessing of the .gff do I need to do in order to eventually get a .gtf compatible with CellRanger

General:

Juke34 commented 4 months ago

Can you show the few first lines of the genomic.gtf file that I can see the line 13 that is problematic?

matejasoretic commented 4 months ago
head -n 15 genomic.gtf
##gtf-version 2.5
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build LonStrDom1
#!genome-build-accession NCBI_Assembly:GCF_002197715.1
#!annotation-source NCBI Lonchura striata domestica Annotation Release 100
##sequence-region NW_018654727.1 1 15662897
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=299123
NW_018655405.1  Gnomon  gene    6165    22916   .   +   .   gene_id "gene3703"; Dbxref "GeneID:110469432"; ID "gene3703"; Name "LOC110469432"; gbkey "Gene"; gene "LOC110469432"; gene_biotype "protein_coding";
NW_018655405.1  Gnomon  transcript  6165    22916   .   +   .   gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432" "Genbank:XM_021528166.1"; ID "rna7531"; Name "XM_021528166.1"; Parent "gene3703"; gbkey "mRNA"; gene "LOC110469432"; model_evidence "Supporting evidence includes similarity to: 1 EST, 1 Protein, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; original_biotype "mrna"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1  Gnomon  exon    6165    6244    .   +   .   gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432" "Genbank:XM_021528166.1"; ID "id103161"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1  Gnomon  exon    7681    7715    .   +   .   gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432" "Genbank:XM_021528166.1"; ID "id103162"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1  Gnomon  exon    16242   16389   .   +   .   gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432" "Genbank:XM_021528166.1"; ID "id103163"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1  Gnomon  exon    17719   17823   .   +   .   gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432" "Genbank:XM_021528166.1"; ID "id103164"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1  Gnomon  exon    22272   22916   .   +   .   gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432" "Genbank:XM_021528166.1"; ID "id103165"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";

I assume the problem is that there are two " in a row at e.g. Dbxref "GeneID:110469432" "Genbank:XM_021528166.1"; before ; ID

Juke34 commented 4 months ago

OK Cellranger does not want attributes with several values like that: Dbxref "GeneID:110469432" "Genbank:XM_021528166.1"; excepted if there is a , in between the values. For the moment you can use a text editor like vsc and make a replacement using regular expression: when it match "[^;,]? you must replace by ",. It means add a colon when there is a quote with a space after.

matejasoretic commented 4 months ago

I altered the gtf, the first lines are:

#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build LonStrDom1
#!genome-build-accession NCBI_Assembly:GCF_002197715.1
#!annotation-source NCBI Lonchura striata domestica Annotation Release 100
##sequence-region NW_018654727.1 1 15662897
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=299123
NW_018655405.1  Gnomon  gene    6165    22916   .   +   .   gene_id "gene3703"; Dbxref "GeneID:110469432"; ID "gene3703"; Name "LOC110469432"; gbkey "Gene"; gene "LOC110469432"; gene_biotype "protein_coding";
NW_018655405.1  Gnomon  transcript  6165    22916   .   +   .   gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432";"Genbank:XM_021528166.1"; ID "rna7531"; Name "XM_021528166.1"; Parent "gene3703"; gbkey "mRNA"; gene "LOC110469432"; model_evidence "Supporting evidence includes similarity to: 1 EST, 1 Protein, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; original_biotype "mrna"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1  Gnomon  exon    6165    6244    .   +   .   gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432";"Genbank:XM_021528166.1"; ID "id103161"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1  Gnomon  exon    7681    7715    .   +   .   gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432";"Genbank:XM_021528166.1"; ID "id103162"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1  Gnomon  exon    16242   16389   .   +   .   gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432";"Genbank:XM_021528166.1"; ID "id103163"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1  Gnomon  exon    17719   17823   .   +   .   gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432";"Genbank:XM_021528166.1"; ID "id103164"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1  Gnomon  exon    22272   22916   .   +   .   gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432";"Genbank:XM_021528166.1"; ID "id103165"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";

I have tried, and this is insufficient to fix the error when running CellRanger mkgtf, I still get the same error. I believe it is insufficient that all the "s are separated by ; I believe that in the case of e.g. Dbxref "GeneID:110469432";"Genbank:XM_021528166.1" either this should be transformed to the form of: Dbxref "GeneID:110469432"; Genbank "XM_021528166.1" or something like: Dbxref "GeneID:110469432"; Dbxref2 "Genbank:XM_021528166.1"

matejasoretic commented 4 months ago

Moreover, if I do change the format to account for this by replacing each instance of '"Genbank:' with 'Genbank "' This gets rid of this error, but now a new error is found at line 4624 This line is, as outputted by R: "NW_018656673.1\tGnomon\tgene\t21\t551\t.\t-\t.\tgene_id \"gene7633\"; Dbxref \"GeneID:110473781\"; ID \"gene7633\"; Name \"LOC110473781\"; end_range \"551\";\".\"; gbkey \"Gene\"; gene \"LOC110473781\"; gene_biotype \"protein_coding\"; partial \"true\"; start_range \".\";\"21\";"

In this case, the issue is the ;"." that occurs before gbkey Once I repaired this error, there was another unrelated error in line 9876 I don't think just going through the whole gtf file doing many substitutions until all of the issues are repaired is the best approach

Juke34 commented 4 months ago

You miss-used the regex replacement

Dbxref "GeneID:110469432" "Genbank:XM_021528166.1"; must become: Dbxref "GeneID:110469432", "Genbank:XM_021528166.1";

What you get is wrong: Dbxref "GeneID:110469432";"Genbank:XM_021528166.1";

Juke34 commented 4 months ago

Capture d’écran 2024-03-04 à 11 22 46 Do not forget to select "regex replacement .*" and the space at the end of each line

matejasoretic commented 4 months ago

What you suggested also didn't not work. If I do as you suggested and get:

##gtf-version 2.5
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build LonStrDom1
#!genome-build-accession NCBI_Assembly:GCF_002197715.1
#!annotation-source NCBI Lonchura striata domestica Annotation Release 100
##sequence-region NW_018654727.1 1 15662897
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=299123
NW_018655405.1  Gnomon  gene    6165    22916   .   +   .   gene_id "gene3703"; Dbxref "GeneID:110469432"; ID "gene3703"; Name "LOC110469432"; gbkey "Gene"; gene "LOC110469432"; gene_biotype "protein_coding";
NW_018655405.1  Gnomon  transcript  6165    22916   .   +   .   gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432","Genbank:XM_021528166.1"; ID "rna7531"; Name "XM_021528166.1"; Parent "gene3703"; gbkey "mRNA"; gene "LOC110469432"; model_evidence "Supporting evidence includes similarity to: 1 EST, 1 Protein, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; original_biotype "mrna"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1  Gnomon  exon    6165    6244    .   +   .   gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432","Genbank:XM_021528166.1"; ID "id103161"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1  Gnomon  exon    7681    7715    .   +   .   gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432","Genbank:XM_021528166.1"; ID "id103162"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1  Gnomon  exon    16242   16389   .   +   .   gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432","Genbank:XM_021528166.1"; ID "id103163"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1  Gnomon  exon    17719   17823   .   +   .   gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432","Genbank:XM_021528166.1"; ID "id103164"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1  Gnomon  exon    22272   22916   .   +   .   gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432","Genbank:XM_021528166.1"; ID "id103165"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";

I again get the error: Error parsing GTF at line 10. Parsed attribute had a quote in the middle of a value. Please ensure quotes are only used to encapsulate attribute values. Bad Attribute Value = ID This error disappeared when I replaced '"Genbank:' with 'Genbank "'

Juke34 commented 4 months ago

Still not ok be careful with the space at the end of each line:

you should get:
Dbxref "GeneID:110469432"," Genbank:XM_021528166.1"; and not:
Dbxref "GeneID:110469432","Genbank:XM_021528166.1";

The space is important. "[^;,]? and "[^;,]? is different
", . and ", . is different

matejasoretic commented 3 months ago

Oh, I see now. Thank you, I am closing the thread