Closed matejasoretic closed 8 months ago
Can you show the few first lines of the genomic.gtf file that I can see the line 13 that is problematic?
head -n 15 genomic.gtf
##gtf-version 2.5
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build LonStrDom1
#!genome-build-accession NCBI_Assembly:GCF_002197715.1
#!annotation-source NCBI Lonchura striata domestica Annotation Release 100
##sequence-region NW_018654727.1 1 15662897
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=299123
NW_018655405.1 Gnomon gene 6165 22916 . + . gene_id "gene3703"; Dbxref "GeneID:110469432"; ID "gene3703"; Name "LOC110469432"; gbkey "Gene"; gene "LOC110469432"; gene_biotype "protein_coding";
NW_018655405.1 Gnomon transcript 6165 22916 . + . gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432" "Genbank:XM_021528166.1"; ID "rna7531"; Name "XM_021528166.1"; Parent "gene3703"; gbkey "mRNA"; gene "LOC110469432"; model_evidence "Supporting evidence includes similarity to: 1 EST, 1 Protein, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; original_biotype "mrna"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1 Gnomon exon 6165 6244 . + . gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432" "Genbank:XM_021528166.1"; ID "id103161"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1 Gnomon exon 7681 7715 . + . gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432" "Genbank:XM_021528166.1"; ID "id103162"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1 Gnomon exon 16242 16389 . + . gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432" "Genbank:XM_021528166.1"; ID "id103163"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1 Gnomon exon 17719 17823 . + . gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432" "Genbank:XM_021528166.1"; ID "id103164"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1 Gnomon exon 22272 22916 . + . gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432" "Genbank:XM_021528166.1"; ID "id103165"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
I assume the problem is that there are two " in a row at e.g. Dbxref "GeneID:110469432" "Genbank:XM_021528166.1"; before ; ID
OK Cellranger does not want attributes with several values like that: Dbxref "GeneID:110469432" "Genbank:XM_021528166.1";
excepted if there is a ,
in between the values.
For the moment you can use a text editor like vsc and make a replacement using regular expression:
when it match "[^;,]?
you must replace by ",
.
It means add a colon when there is a quote with a space after.
I altered the gtf, the first lines are:
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build LonStrDom1
#!genome-build-accession NCBI_Assembly:GCF_002197715.1
#!annotation-source NCBI Lonchura striata domestica Annotation Release 100
##sequence-region NW_018654727.1 1 15662897
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=299123
NW_018655405.1 Gnomon gene 6165 22916 . + . gene_id "gene3703"; Dbxref "GeneID:110469432"; ID "gene3703"; Name "LOC110469432"; gbkey "Gene"; gene "LOC110469432"; gene_biotype "protein_coding";
NW_018655405.1 Gnomon transcript 6165 22916 . + . gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432";"Genbank:XM_021528166.1"; ID "rna7531"; Name "XM_021528166.1"; Parent "gene3703"; gbkey "mRNA"; gene "LOC110469432"; model_evidence "Supporting evidence includes similarity to: 1 EST, 1 Protein, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; original_biotype "mrna"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1 Gnomon exon 6165 6244 . + . gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432";"Genbank:XM_021528166.1"; ID "id103161"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1 Gnomon exon 7681 7715 . + . gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432";"Genbank:XM_021528166.1"; ID "id103162"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1 Gnomon exon 16242 16389 . + . gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432";"Genbank:XM_021528166.1"; ID "id103163"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1 Gnomon exon 17719 17823 . + . gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432";"Genbank:XM_021528166.1"; ID "id103164"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1 Gnomon exon 22272 22916 . + . gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432";"Genbank:XM_021528166.1"; ID "id103165"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
I have tried, and this is insufficient to fix the error when running CellRanger mkgtf, I still get the same error. I believe it is insufficient that all the "s are separated by ; I believe that in the case of e.g. Dbxref "GeneID:110469432";"Genbank:XM_021528166.1" either this should be transformed to the form of: Dbxref "GeneID:110469432"; Genbank "XM_021528166.1" or something like: Dbxref "GeneID:110469432"; Dbxref2 "Genbank:XM_021528166.1"
Moreover, if I do change the format to account for this by replacing each instance of '"Genbank:' with 'Genbank "' This gets rid of this error, but now a new error is found at line 4624 This line is, as outputted by R: "NW_018656673.1\tGnomon\tgene\t21\t551\t.\t-\t.\tgene_id \"gene7633\"; Dbxref \"GeneID:110473781\"; ID \"gene7633\"; Name \"LOC110473781\"; end_range \"551\";\".\"; gbkey \"Gene\"; gene \"LOC110473781\"; gene_biotype \"protein_coding\"; partial \"true\"; start_range \".\";\"21\";"
In this case, the issue is the ;"." that occurs before gbkey Once I repaired this error, there was another unrelated error in line 9876 I don't think just going through the whole gtf file doing many substitutions until all of the issues are repaired is the best approach
You miss-used the regex replacement
Dbxref "GeneID:110469432" "Genbank:XM_021528166.1";
must become:
Dbxref "GeneID:110469432", "Genbank:XM_021528166.1";
What you get is wrong:
Dbxref "GeneID:110469432";"Genbank:XM_021528166.1";
Do not forget to select "regex replacement .*
" and the space at the end of each line
What you suggested also didn't not work. If I do as you suggested and get:
##gtf-version 2.5
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build LonStrDom1
#!genome-build-accession NCBI_Assembly:GCF_002197715.1
#!annotation-source NCBI Lonchura striata domestica Annotation Release 100
##sequence-region NW_018654727.1 1 15662897
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=299123
NW_018655405.1 Gnomon gene 6165 22916 . + . gene_id "gene3703"; Dbxref "GeneID:110469432"; ID "gene3703"; Name "LOC110469432"; gbkey "Gene"; gene "LOC110469432"; gene_biotype "protein_coding";
NW_018655405.1 Gnomon transcript 6165 22916 . + . gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432","Genbank:XM_021528166.1"; ID "rna7531"; Name "XM_021528166.1"; Parent "gene3703"; gbkey "mRNA"; gene "LOC110469432"; model_evidence "Supporting evidence includes similarity to: 1 EST, 1 Protein, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; original_biotype "mrna"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1 Gnomon exon 6165 6244 . + . gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432","Genbank:XM_021528166.1"; ID "id103161"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1 Gnomon exon 7681 7715 . + . gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432","Genbank:XM_021528166.1"; ID "id103162"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1 Gnomon exon 16242 16389 . + . gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432","Genbank:XM_021528166.1"; ID "id103163"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1 Gnomon exon 17719 17823 . + . gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432","Genbank:XM_021528166.1"; ID "id103164"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
NW_018655405.1 Gnomon exon 22272 22916 . + . gene_id "gene3703"; transcript_id "XM_021528166.1"; Dbxref "GeneID:110469432","Genbank:XM_021528166.1"; ID "id103165"; Parent "rna7531"; gbkey "mRNA"; gene "LOC110469432"; product "PRKC apoptosis WT1 regulator protein-like";
I again get the error: Error parsing GTF at line 10. Parsed attribute had a quote in the middle of a value. Please ensure quotes are only used to encapsulate attribute values. Bad Attribute Value = ID This error disappeared when I replaced '"Genbank:' with 'Genbank "'
Still not ok be careful with the space at the end of each line:
you should get:
Dbxref "GeneID:110469432"," Genbank:XM_021528166.1";
and not:
Dbxref "GeneID:110469432","Genbank:XM_021528166.1";
The space is important.
"[^;,]?
and "[^;,]?
is different
", .
and ", .
is different
Oh, I see now. Thank you, I am closing the thread
Describe the bug Hello, I am having issues with getting agat_convert_sp_gff2gtf.pl to convert .gff files from the NCBI to .gtf files which are compatible with CellRanger. Specifically, I am trying to use it on the RefSeq .gff file found here https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_002197715.1/
When I try running
agat_convert_sp_gff2gtf.pl -gff genomic.gff -o genomic.gtf --gtf_version 2.5
I get an output gtf file but when I try running cellranger mkgtf on it:cellranger mkgtf genomic.gtf genomic.filtered.gtf --attribute=gene_biotype:protein_coding
What command or preprocessing of the .gff do I need to do in order to eventually get a .gtf compatible with CellRanger
General: