Closed KaiOtsuka closed 10 months ago
Dear @KaiOtsuka,
It will be a pleasure to assist you with this issue.
Following your log file, the identification of TE-initiated, TE-exonized and TE-terminated transcripts were successfully done in both replicates. Can you confirm that for me? Just check if you have .ct tables in projects/$your_project_name/tmp/
If the files are there as expected, could you send the line with the TE-initiated from the gene "ID_B1_dup18237;L1ME3A_dup375;" for Rep1 and Rep2? if it exists in both replicates.
In addition, your 5 rows have only exon regions, could you send me 5 rows of genes? You can have it with awk '$3 == "gene"' file.gtf | head -5
'
Looks like there is something simple going on here. We will solve it.
Hi Daniel, I'm using ChimeraTE on Drosophila samples. I had the same error of Kaiotsuka, can you help me to solve the problem? I send you the log file. I checked the tmp folder and I don't have the .ct tables, i send you the head of my gtf file in which i have only exon features and not gene in the third column. log.txt Thanks for your help
2L dm6_rmsk exon 1000001 1005388 48233 - . gene_id "ROO_I-int"; transcript_id "ROO_I-int_dup219"; family_id "Pao"; class_id "LTR"; 2L dm6_rmsk exon 10033481 10033519 226 + . gene_id "DNAREP1_DM"; transcript_id "DNAREP1_DM_dup1504"; family_id "Helitron"; class_id "RC"; 2L dm6_rmsk exon 10033628 10033673 270 + . gene_id "DNAREP1_DM"; transcript_id "DNAREP1_DM_dup1505"; family_id "Helitron"; class_id "RC"; 2L dm6_rmsk exon 10050978 10051031 243 + . gene_id "DNAREP1_DM"; transcript_id "DNAREP1_DM_dup1506"; family_id "Helitron"; class_id "RC"; 2L dm6_rmsk exon 1005389 1005816 3968 - . gene_id "ROO_LTR"; transcript_id "ROO_LTR_dup64"; family_id "Pao"; class_id "LTR"; 2L dm6_rmsk exon 10118445 10118565 739 - . gene_id "HOBO"; transcript_id "HOBO_dup25"; family_id "hAT-hobo"; class_id "DNA"; 2L dm6_rmsk exon 10138219 10143401 47824 - . gene_id "DMRT1B"; transcript_id "DMRT1B_dup69"; family_id "R1"; class_id "LINE"; 2L dm6_rmsk exon 10156377 10156777 3730 - . gene_id "BLOOD_LTR"; transcript_id "BLOOD_LTR_dup16"; family_id "Gypsy"; class_id "LTR"; 2L dm6_rmsk exon 10156778 10163388 60250 - . gene_id "BLOOD_I-int"; transcript_id "BLOOD_I-int_dup8"; family_id "Gypsy"; class_id "LTR"; 2L dm6_rmsk exon 10163389 10163789 3730 - . gene_id "BLOOD_LTR"; transcript_id "BLOOD_LTR_dup17"; family_id "Gypsy"; class_id "LTR";
Hi @Drosofriends (nice nickname)
Please, use this command line to "clean" your TE gtf file:
sed 's/gene_id//g; s/";.*//g; s/"//g; s/ /\t/g; s/\t\t/\t/g' TE.gtf > TE_clean.gtf
You can already try to run ChimeraTE with that.
Be sure that your gene.gtf file is in the right format. You can compare with the one in example_data/mode1 folder.
Thank you so much! now I improved the TE.gtf file in the right format but I noticed that infact my gene.gtf is slightly different from the one uploaded as example. As you did for the TE_clean.gtf, could you please send me a command to generate it in the right mode? This will help me to avoid any other problem due to malformed gtf file. Here the first line of my gtf
3R FlyBase gene 7145880 7150968 . - . gene_id "FBgn0250732"; gene_name "gfzf"; gene_source "FlyBase"; gene_biotype "protein_coding"; 3R FlyBase transcript 7145880 7150968 . - . gene_id "FBgn0250732"; transcript_id "FBtr0091512"; gene_name "gfzf"; gene_source "FlyBase"; gene_biotype "protein_coding"; transcript_name "gfzf-RB"; transcript_source "FlyBase"; transcript_biotype "protein_coding"; 3R FlyBase exon 7150166 7150968 . - . gene_id "FBgn0250732"; transcript_id "FBtr0091512"; exon_number "1"; gene_name "gfzf"; gene_source "FlyBase"; gene_biotype "protein_coding"; transcript_name "gfzf-RB"; transcript_source "FlyBase"; transcript_biotype "protein_coding"; exon_id "FBtr0091512-E1"; 3R FlyBase CDS 7150166 7150630 . - 0 gene_id "FBgn0250732"; transcript_id "FBtr0091512"; exon_number "1"; gene_name "gfzf"; gene_source "FlyBase"; gene_biotype "protein_coding"; transcript_name "gfzf-RB"; transcript_source "FlyBase"; transcript_biotype "protein_coding"; protein_id "FBpp0290855"; 3R FlyBase start_codon 7150628 7150630 . - 0 gene_id "FBgn0250732"; transcript_id "FBtr0091512"; exon_number "1"; gene_name "gfzf"; gene_source "FlyBase"; gene_biotype "protein_coding"; transcript_name "gfzf-RB"; transcript_source "FlyBase"; transcript_biotype "protein_coding";
Best regards
If your gene.gtf file is tab delimited, with nine columns, it is ready to be used.
The most important information is "gene" and "exon" in the third column, and then your FBgn code in the 9th column. ChimeraTE is written to ignore the "gene_id" before your FBgn code, so don't worry.
In this sample that you sent me you have everything you need.
Closing this issue. Feel free to reopen it.
Hi Daniel,
I encountered the following error during the pipeline process(I also attached a log file here). It appears immediately after the analysis of replicates. (Both ## TE-exonized analysis ## for Rep1 and Rep2 already finished.) Would you help me to solve this issue, please?
Is this due to the format of the GTF file I used? Just in case, I attached the first 5 rows of my reference gtf file, here. (cat -t myfile.gtf)
I'd be very happy if you would reply to me to solve this problem. Thanks
log.txt