Closed Long-zhe closed 3 weeks ago
Hi, I will first answer about the first problem. I managed to get to the problematic line by cutting recursively the dataset into two parts and testing if it was working. I end up with line 1335125 which contains a ";" in the value of the 'description' key. This is also, unfortunatly the field separator of column 9 keys...
head -n 1335125 GCF_019924925.1_HZGC01_genomic.gtf | tail -n 1| cut -f 9
gene_id "pygb"; transcript_id ""; db_xref "GeneID:127498152"; description "phosphorylase, glycogen; brain"; gbkey "Gene"; gene "pygb"; gene_biotype "protein_coding";
You can correct this simply by:
sed 's/phosphorylase, glycogen; brain/phosphorylase, glycogen, brain/' GCF_019924925.1_HZGC01_genomic.gtf > GCF_019924925.1_HZGC01_genomic_v2.gtf
gtftk count -i GCF_019924925.1_HZGC01_genomic_v2.gtf
gene 31777
transcript 69595
exon 882914
CDS 783143
start_codon 59458
stop_codon 59497
I will try to provide an answer to the second issue. Best
In fact everything seems to be working as soon as you have corrected the GTF file:
gtftk convert_ensembl -i GCF_019924925.1_HZGC01_genomic_v2.gtf -o GCF_019924925.1_HZGC01_genomic_v2_ens.gtf
gtftk count -i GCF_019924925.1_HZGC01_genomic_v2_ens.gtf
gene 31801
transcript 69595
exon 882914
CDS 783143
start_codon 59458
stop_codon 59497
The fact that the number of genes (31801) is now higher (compared to 31777) indicates that there were some transcripts without any associated "gene" lines in the original file (that is lines whose elements are flag as 'gene' in the 3rd column).
awk 'BEGIN{FS=OFS="\t"}$3=="gene"' GCF_019924925.1_HZGC01_genomic_v2.gtf > GCF_019924925.1_HZGC01_genomic_v2_gene.gtf
awk 'BEGIN{FS=OFS="\t"}$3=="gene"' GCF_019924925.1_HZGC01_genomic_v2_ens.gtf > GCF_019924925.1_HZGC01_genomic_v2_ens_gene.gtf
diff -W 250 -y GCF_019924925.1_HZGC01_genomic_v2_gene.gtf GCF_019924925.1_HZGC01_genomic_v2_ens_gene.gtf | grep ">"| nl
There are exactly 24 (31801 - 31777).
For instance, transcript from "unassigned_gene_12" have no corresponding gene line in original file.
grep "unassigned_gene_12" GCF_019924925.1_HZGC01_genomic_v2.gtf
NC_010288.1 RefSeq transcript 5351 5418 . - . gene_id "unassigned_gene_12"; transcript_id "unassigned_transcript_1764"; gbkey "tRNA"; product "tRNA-Cys"; transcript_biotype "tRNA";
NC_010288.1 RefSeq exon 5351 5418 . - . gene_id "unassigned_gene_12"; transcript_id "unassigned_transcript_1764"; product "tRNA-Cys"; transcript_biotype "tRNA"; exon_number "1";
We can naturally correct this since (and this is one of the job of gtftk convert_ensembl). If we know the tss and tts of at least one transcript, then we can use them as gene coordinates.
Dear developer:
First of all, thank you very much for patiently and carefully solving my problem. According to your operating method, I successfully solved the problem,Sincerely thank you! Secondly, regarding how you discovered the issues in the GTF(I managed to get to the problematic line by cutting recursively the dataset into two parts and testing if it was working. I end up with line 1335125 which contains a ";" in the value of the 'description' key.)
I really want to know how you quickly found the problem at line 1335125.
Based on your description of solving the problem, my simple understanding is that you divided the GTF file into two parts and tested each part separately using gtftk. However, one part should not be able to run properly. And you divide it into two files again and repeat until find the line where the problem lies? I don't know if my understanding is correct.
May I ask if you can demonstrate the code used during your processing? So next time I encounter the same problem, I can solve it myself, Thank you very much.! If there is any offense, please forgive me! Please forgive me for making such a low-level request as a novice!
Hi, Yes this is exactly what I did. I did not produce any script but just did it directly in the terminal. You are welcome and thanks for your interest in gtftk. Best
Le ven. 11 oct. 2024 à 17:31, Long Zhe @.***> a écrit :
Dear developer:
First of all, thank you very much for patiently and carefully solving my problem. According to your operating method, I successfully solved the problem,Sincerely thank you! Secondly, regarding how you discovered the issues in the GTF(I managed to get to the problematic line by cutting recursively the dataset into two parts and testing if it was working. I end up with line 1335125 which contains a ";" in the value of the 'description' key.)
I really want to know how you quickly found the problem at line 1335125.
Based on your description of solving the problem, my simple understanding is that you divided the GTF file into two parts and tested each part separately using gtftk. However, one part should not be able to run properly. And you divide it into two files again and repeat until find the line where the problem lies? I don't know if my understanding is correct.
May I ask if you can demonstrate the code used during your processing? So next time I encounter the same problem, I can solve it myself, Thank you very much.! If there is any offense, please forgive me! Please forgive me for making such a low-level request as a novice!
— Reply to this email directly, view it on GitHub https://github.com/dputhier/pygtftk/issues/187#issuecomment-2407656503, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN7CHU4XZV7UJ5YQGQUTX3Z27VMBAVCNFSM6AAAAABPW745SKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBXGY2TMNJQGM . You are receiving this because you commented.Message ID: @.***>
Dear developer, I'm sorry to disturb you! When I use the gtftk count function, there are some errors: This is the GTF file link I am using(https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/924/925/GCF_019924925.1_HZGC01/GCF_019924925.1_HZGC01_genomic.gtf.gz)
gzip
-d GCF_019924925.1_HZGC01_genomic.gtf.gzgtftk
count -i GCF_019924925.1_HZGC01_genomic.gtf /public/softwares/miniconda3/bin/gtftk:53: DeprecationWarning: Use shutil.which instead of find_executable if not find_executable("bedtools"): Segmentation fault (core dumped)In addition, I also downloaded the gff file (https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/924/925/GCF_019924925.1_HZGC01/GCF_019924925.1_HZGC01_genomic.gff.gz)for experimentation. Firstly, I used the gffread software to convert gff to gtf,and then tried to convert it to an ensemble format gtf file using gtftk convert_ensembl, but encountered additional issues:
gzip
-d GCF_019924925.1_HZGC01_genomic.gff.gzgffread
-T -o GCF_019924925.1_HZGC01_genomic_gffread.gtf GCF_019924925.1_HZGC01_genomic.gffgtftk
convert_ensembl -i GCF_019924925.1_HZGC01_genomic_gffread.gtf > ensembl.gtf /public/softwares/miniconda3/bin/gtftk:53: DeprecationWarning: Use shutil.which instead of find_executable if not find_executable("bedtools"): |-- 22:58-ERROR-convert_ensembl : the gene_id for ? is associated to multiple chromosomes(NC_067221.1, NC_067220.1). Use non ambiguous gene ids (e.g ensembl ids) please.As a beginner, I have searched for a lot of information but have not been able to successfully solve the problem. I look forward to your reply and apologize for any inconvenience caused!