Segmentation fault (core dumped)

Long-zhe commented 4 weeks ago

Dear developer, I'm sorry to disturb you！ When I use the gtftk count function, there are some errors： This is the GTF file link I am using（https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/924/925/GCF_019924925.1_HZGC01/GCF_019924925.1_HZGC01_genomic.gtf.gz）

gzip -d GCF_019924925.1_HZGC01_genomic.gtf.gz gtftk count -i GCF_019924925.1_HZGC01_genomic.gtf /public/softwares/miniconda3/bin/gtftk:53: DeprecationWarning: Use shutil.which instead of find_executable if not find_executable("bedtools"): Segmentation fault (core dumped)

In addition, I also downloaded the gff file （https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/924/925/GCF_019924925.1_HZGC01/GCF_019924925.1_HZGC01_genomic.gff.gz）for experimentation. Firstly, I used the gffread software to convert gff to gtf，and then tried to convert it to an ensemble format gtf file using gtftk convert_ensembl, but encountered additional issues：

gzip -d GCF_019924925.1_HZGC01_genomic.gff.gz gffread -T -o GCF_019924925.1_HZGC01_genomic_gffread.gtf GCF_019924925.1_HZGC01_genomic.gff gtftk convert_ensembl -i GCF_019924925.1_HZGC01_genomic_gffread.gtf > ensembl.gtf /public/softwares/miniconda3/bin/gtftk:53: DeprecationWarning: Use shutil.which instead of find_executable if not find_executable("bedtools"): |-- 22:58-ERROR-convert_ensembl : the gene_id for ? is associated to multiple chromosomes(NC_067221.1, NC_067220.1). Use non ambiguous gene ids (e.g ensembl ids) please.

As a beginner, I have searched for a lot of information but have not been able to successfully solve the problem. I look forward to your reply and apologize for any inconvenience caused！

dputhier commented 4 weeks ago

Hi, I will first answer about the first problem. I managed to get to the problematic line by cutting recursively the dataset into two parts and testing if it was working. I end up with line 1335125 which contains a ";" in the value of the 'description' key. This is also, unfortunatly the field separator of column 9 keys...

     head -n 1335125 GCF_019924925.1_HZGC01_genomic.gtf | tail -n 1| cut -f 9
     gene_id "pygb"; transcript_id ""; db_xref "GeneID:127498152"; description "phosphorylase, glycogen; brain"; gbkey "Gene"; gene "pygb"; gene_biotype "protein_coding";

You can correct this simply by:

        sed 's/phosphorylase, glycogen; brain/phosphorylase, glycogen, brain/' GCF_019924925.1_HZGC01_genomic.gtf > GCF_019924925.1_HZGC01_genomic_v2.gtf
        gtftk count -i GCF_019924925.1_HZGC01_genomic_v2.gtf
    gene    31777
    transcript  69595
    exon    882914
    CDS 783143
    start_codon 59458
    stop_codon  59497

I will try to provide an answer to the second issue. Best

dputhier commented 4 weeks ago

In fact everything seems to be working as soon as you have corrected the GTF file:

    gtftk convert_ensembl -i GCF_019924925.1_HZGC01_genomic_v2.gtf -o GCF_019924925.1_HZGC01_genomic_v2_ens.gtf
    gtftk count -i GCF_019924925.1_HZGC01_genomic_v2_ens.gtf
    gene    31801
    transcript  69595
    exon    882914
    CDS 783143
    start_codon 59458
    stop_codon  59497

The fact that the number of genes (31801) is now higher (compared to 31777) indicates that there were some transcripts without any associated "gene" lines in the original file (that is lines whose elements are flag as 'gene' in the 3rd column).

        awk 'BEGIN{FS=OFS="\t"}$3=="gene"' GCF_019924925.1_HZGC01_genomic_v2.gtf > GCF_019924925.1_HZGC01_genomic_v2_gene.gtf
        awk 'BEGIN{FS=OFS="\t"}$3=="gene"' GCF_019924925.1_HZGC01_genomic_v2_ens.gtf > GCF_019924925.1_HZGC01_genomic_v2_ens_gene.gtf
        diff   -W 250 -y  GCF_019924925.1_HZGC01_genomic_v2_gene.gtf GCF_019924925.1_HZGC01_genomic_v2_ens_gene.gtf  | grep ">"| nl

There are exactly 24 (31801 - 31777).

For instance, transcript from "unassigned_gene_12" have no corresponding gene line in original file.

    grep "unassigned_gene_12" GCF_019924925.1_HZGC01_genomic_v2.gtf
    NC_010288.1 RefSeq  transcript  5351    5418    .   -   .   gene_id "unassigned_gene_12"; transcript_id "unassigned_transcript_1764"; gbkey "tRNA"; product "tRNA-Cys"; transcript_biotype "tRNA";
    NC_010288.1 RefSeq  exon    5351    5418    .   -   .   gene_id "unassigned_gene_12"; transcript_id "unassigned_transcript_1764"; product "tRNA-Cys"; transcript_biotype "tRNA"; exon_number "1";

We can naturally correct this since (and this is one of the job of gtftk convert_ensembl). If we know the tss and tts of at least one transcript, then we can use them as gene coordinates.

Long-zhe commented 3 weeks ago

Dear developer：

First of all, thank you very much for patiently and carefully solving my problem. According to your operating method, I successfully solved the problem，Sincerely thank you！ Secondly, regarding how you discovered the issues in the GTF（I managed to get to the problematic line by cutting recursively the dataset into two parts and testing if it was working. I end up with line 1335125 which contains a ";" in the value of the 'description' key.）

I really want to know how you quickly found the problem at line 1335125.

Based on your description of solving the problem, my simple understanding is that you divided the GTF file into two parts and tested each part separately using gtftk. However, one part should not be able to run properly. And you divide it into two files again and repeat until find the line where the problem lies? I don't know if my understanding is correct.

May I ask if you can demonstrate the code used during your processing? So next time I encounter the same problem, I can solve it myself, Thank you very much.! If there is any offense, please forgive me! Please forgive me for making such a low-level request as a novice！

dputhier commented 3 weeks ago

Hi, Yes this is exactly what I did. I did not produce any script but just did it directly in the terminal. You are welcome and thanks for your interest in gtftk. Best

Le ven. 11 oct. 2024 à 17:31, Long Zhe @.***> a écrit :

Dear developer：

First of all, thank you very much for patiently and carefully solving my problem. According to your operating method, I successfully solved the problem，Sincerely thank you！ Secondly, regarding how you discovered the issues in the GTF（I managed to get to the problematic line by cutting recursively the dataset into two parts and testing if it was working. I end up with line 1335125 which contains a ";" in the value of the 'description' key.）

I really want to know how you quickly found the problem at line 1335125.

Based on your description of solving the problem, my simple understanding is that you divided the GTF file into two parts and tested each part separately using gtftk. However, one part should not be able to run properly. And you divide it into two files again and repeat until find the line where the problem lies? I don't know if my understanding is correct.

May I ask if you can demonstrate the code used during your processing? So next time I encounter the same problem, I can solve it myself, Thank you very much.! If there is any offense, please forgive me! Please forgive me for making such a low-level request as a novice！

— Reply to this email directly, view it on GitHub https://github.com/dputhier/pygtftk/issues/187#issuecomment-2407656503, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN7CHU4XZV7UJ5YQGQUTX3Z27VMBAVCNFSM6AAAAABPW745SKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBXGY2TMNJQGM . You are receiving this because you commented.Message ID: @.***>

dputhier / pygtftk

Segmentation fault (core dumped) #187