dputhier / pygtftk

A python package and a set of shell commands to handle GTF files
GNU General Public License v3.0
45 stars 6 forks source link

Segmentation fault core dumped error when reading GTF in python script #179

Closed almart7 closed 1 year ago

almart7 commented 1 year ago

Dear developers, I am facing a weird issue when reading RefSeq GTFs from specific species. When working with human data pygtftk works fine, but when trying to parse mouse and zebrafish GTFs it causes a segmentation fault core dumped error. This happens in ouf cluster but also in my desktop computer.

In my script, I've checked that the step that fails is the following: gtf_content = GTF(gtf_file)

Here are the ftp links to the data I am using:

dputhier commented 1 year ago

Dear Alessandra,

I had a few minutes to look at the problem with the GCF_000002035.6_GRCz11_genomic.gtf file (not yet for the GCF_000001635.26_GRCm38.p6_genomic.gtf.gz file). In GTF files, the ";" character is used to separate key-value pairs enclosed in the last column. Our software breaks the character string on the ";" in order to isolate the different key-value pairs. In the file you provided, one of the transcripts has the product key associated with the value "phosphorylase, glycogen; brain". Since this value contains a ";", it disrupts the reading of the file...

     grep "phosphorylase, glycogen; brain" GCF_000002035.6_GRCz11_genomic.gtf

This requires us to rework the code and make significant changes to the C code (libgtftk). This problem is uncommon but we will try to solve it in the future.

In the meantime, for you, the simplest solution is to change the incriminated separator:

     curl https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_other/Danio_rerio/reference/GCF_000002035.6_GRCz11/GCF_000002035.6_GRCz11_genomic.gtf.gz | gunzip -c | perl -npe 's/phosphorylase, glycogen; brain/phosphorylase, glycogen: brain/' | gtftk convert_ensembl > GCF_000002035.6_GRCz11_genomic_fixed.gtf

I cant provide you with the corresponding code in Python (feel free to ask).

Also, I would like to point out that, from my previous experiences, it is not a good idea to use refSeq annotations when performing some genomic operations. The reason is that refSeq are sequence identifiers not genomic location identifiers. If you have several instances of the same sequence on the same chromosome, they will have the same ID (to my knowledge) and there will be no way to get the 5' and 3' ends of each instance, their sizes... If possible, I would favor ensembl (https://www.ensembl.org/info/data/ftp/index.html).

Best

PS: If needed I can have a look at the other file.

almart7 commented 1 year ago

Thank you for the information!