Closed almart7 closed 1 year ago
Dear Alessandra,
I had a few minutes to look at the problem with the GCF_000002035.6_GRCz11_genomic.gtf file (not yet for the GCF_000001635.26_GRCm38.p6_genomic.gtf.gz file). In GTF files, the ";" character is used to separate key-value pairs enclosed in the last column. Our software breaks the character string on the ";" in order to isolate the different key-value pairs. In the file you provided, one of the transcripts has the product key associated with the value "phosphorylase, glycogen; brain". Since this value contains a ";", it disrupts the reading of the file...
grep "phosphorylase, glycogen; brain" GCF_000002035.6_GRCz11_genomic.gtf
This requires us to rework the code and make significant changes to the C code (libgtftk). This problem is uncommon but we will try to solve it in the future.
In the meantime, for you, the simplest solution is to change the incriminated separator:
curl https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_other/Danio_rerio/reference/GCF_000002035.6_GRCz11/GCF_000002035.6_GRCz11_genomic.gtf.gz | gunzip -c | perl -npe 's/phosphorylase, glycogen; brain/phosphorylase, glycogen: brain/' | gtftk convert_ensembl > GCF_000002035.6_GRCz11_genomic_fixed.gtf
I cant provide you with the corresponding code in Python (feel free to ask).
Also, I would like to point out that, from my previous experiences, it is not a good idea to use refSeq annotations when performing some genomic operations. The reason is that refSeq are sequence identifiers not genomic location identifiers. If you have several instances of the same sequence on the same chromosome, they will have the same ID (to my knowledge) and there will be no way to get the 5' and 3' ends of each instance, their sizes... If possible, I would favor ensembl (https://www.ensembl.org/info/data/ftp/index.html).
Best
PS: If needed I can have a look at the other file.
Thank you for the information!
Dear developers, I am facing a weird issue when reading RefSeq GTFs from specific species. When working with human data pygtftk works fine, but when trying to parse mouse and zebrafish GTFs it causes a segmentation fault core dumped error. This happens in ouf cluster but also in my desktop computer.
In my script, I've checked that the step that fails is the following: gtf_content = GTF(gtf_file)
Here are the ftp links to the data I am using:
Homo sapiens (it works fine) https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz
Mus musculus (segmentation fault with both, current and previous version) https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Mus_musculus/reference/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_genomic.gtf.gz https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Mus_musculus/all_assembly_versions/GCF_000001635.26_GRCm38.p6/GCF_000001635.26_GRCm38.p6_genomic.gtf.gz
Danio rerio (segmentation fault) https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_other/Danio_rerio/reference/GCF_000002035.6_GRCz11/GCF_000002035.6_GRCz11_genomic.gtf.gz