OliveiraDS-hub / ChimeraTE

A pipeline to detect chimeric transcripts derived from genes and transposable elements.
GNU General Public License v3.0
21 stars 6 forks source link

ValueError arose after Rep1 and 2 analysis #12

Closed KaiOtsuka closed 10 months ago

KaiOtsuka commented 11 months ago

Hi Daniel,

I encountered the following error during the pipeline process(I also attached a log file here). It appears immediately after the analysis of replicates. (Both ## TE-exonized analysis ## for Rep1 and Rep2 already finished.) Would you help me to solve this issue, please?

Traceback (most recent call last): File "path/to/nanops.py", line 1427, in _ensure_numeric x = float(x) ValueError: could not convert string to float: 'ID_B1_dup18237;L1ME3A_dup375;'

During handling of the above exception, another exception occurred:

Is this due to the format of the GTF file I used? Just in case, I attached the first 5 rows of my reference gtf file, here. (cat -t myfile.gtf)

chr1^Imm10_rmsk^Iexon^I3000001^I3002128^I12955^I-^I.^Igene_id "L1_Mus3"; transcript_id "L1_Mus3_dup24"; family_id "L1"; class_id "LINE"; chr1^Imm10_rmsk^Iexon^I3003153^I3003994^I1216^I-^I.^Igene_id "L1Md_F"; transcript_id "L1Md_F_dup2"; family_id "L1"; class_id "LINE"; chr1^Imm10_rmsk^Iexon^I3003994^I3004054^I234^I-^I.^Igene_id "L1_Mus3"; transcript_id "L1_Mus3_dup25"; family_id "L1"; class_id "LINE"; chr1^Imm10_rmsk^Iexon^I3004041^I3004206^I3685^I+^I.^Igene_id "L1_Rod"; transcript_id "L1_Rod_dup1"; family_id "L1"; class_id "LINE"; chr1^Imm10_rmsk^Iexon^I3004271^I3005001^I3685^I+^I.^Igene_id "L1_Rod"; transcript_id "L1_Rod_dup2"; family_id "L1"; class_id "LINE";

I'd be very happy if you would reply to me to solve this problem. Thanks

log.txt

OliveiraDS-hub commented 11 months ago

Dear @KaiOtsuka,

It will be a pleasure to assist you with this issue.

Following your log file, the identification of TE-initiated, TE-exonized and TE-terminated transcripts were successfully done in both replicates. Can you confirm that for me? Just check if you have .ct tables in projects/$your_project_name/tmp/

If the files are there as expected, could you send the line with the TE-initiated from the gene "ID_B1_dup18237;L1ME3A_dup375;" for Rep1 and Rep2? if it exists in both replicates.

In addition, your 5 rows have only exon regions, could you send me 5 rows of genes? You can have it with awk '$3 == "gene"' file.gtf | head -5'

Looks like there is something simple going on here. We will solve it.

Drosofriends commented 10 months ago

Hi Daniel, I'm using ChimeraTE on Drosophila samples. I had the same error of Kaiotsuka, can you help me to solve the problem? I send you the log file. I checked the tmp folder and I don't have the .ct tables, i send you the head of my gtf file in which i have only exon features and not gene in the third column. log.txt Thanks for your help

2L dm6_rmsk exon 1000001 1005388 48233 - . gene_id "ROO_I-int"; transcript_id "ROO_I-int_dup219"; family_id "Pao"; class_id "LTR"; 2L dm6_rmsk exon 10033481 10033519 226 + . gene_id "DNAREP1_DM"; transcript_id "DNAREP1_DM_dup1504"; family_id "Helitron"; class_id "RC"; 2L dm6_rmsk exon 10033628 10033673 270 + . gene_id "DNAREP1_DM"; transcript_id "DNAREP1_DM_dup1505"; family_id "Helitron"; class_id "RC"; 2L dm6_rmsk exon 10050978 10051031 243 + . gene_id "DNAREP1_DM"; transcript_id "DNAREP1_DM_dup1506"; family_id "Helitron"; class_id "RC"; 2L dm6_rmsk exon 1005389 1005816 3968 - . gene_id "ROO_LTR"; transcript_id "ROO_LTR_dup64"; family_id "Pao"; class_id "LTR"; 2L dm6_rmsk exon 10118445 10118565 739 - . gene_id "HOBO"; transcript_id "HOBO_dup25"; family_id "hAT-hobo"; class_id "DNA"; 2L dm6_rmsk exon 10138219 10143401 47824 - . gene_id "DMRT1B"; transcript_id "DMRT1B_dup69"; family_id "R1"; class_id "LINE"; 2L dm6_rmsk exon 10156377 10156777 3730 - . gene_id "BLOOD_LTR"; transcript_id "BLOOD_LTR_dup16"; family_id "Gypsy"; class_id "LTR"; 2L dm6_rmsk exon 10156778 10163388 60250 - . gene_id "BLOOD_I-int"; transcript_id "BLOOD_I-int_dup8"; family_id "Gypsy"; class_id "LTR"; 2L dm6_rmsk exon 10163389 10163789 3730 - . gene_id "BLOOD_LTR"; transcript_id "BLOOD_LTR_dup17"; family_id "Gypsy"; class_id "LTR";

OliveiraDS-hub commented 10 months ago

Hi @Drosofriends (nice nickname)

Please, use this command line to "clean" your TE gtf file:

sed 's/gene_id//g; s/";.*//g; s/"//g; s/ /\t/g; s/\t\t/\t/g' TE.gtf > TE_clean.gtf

You can already try to run ChimeraTE with that.

Be sure that your gene.gtf file is in the right format. You can compare with the one in example_data/mode1 folder.

Drosofriends commented 10 months ago

Thank you so much! now I improved the TE.gtf file in the right format but I noticed that infact my gene.gtf is slightly different from the one uploaded as example. As you did for the TE_clean.gtf, could you please send me a command to generate it in the right mode? This will help me to avoid any other problem due to malformed gtf file. Here the first line of my gtf

!genome-build BDGP6.32

!genome-version BDGP6.32

!genome-build-accession GCA_000001215.4

!genebuild-last-updated 2020-08

3R FlyBase gene 7145880 7150968 . - . gene_id "FBgn0250732"; gene_name "gfzf"; gene_source "FlyBase"; gene_biotype "protein_coding"; 3R FlyBase transcript 7145880 7150968 . - . gene_id "FBgn0250732"; transcript_id "FBtr0091512"; gene_name "gfzf"; gene_source "FlyBase"; gene_biotype "protein_coding"; transcript_name "gfzf-RB"; transcript_source "FlyBase"; transcript_biotype "protein_coding"; 3R FlyBase exon 7150166 7150968 . - . gene_id "FBgn0250732"; transcript_id "FBtr0091512"; exon_number "1"; gene_name "gfzf"; gene_source "FlyBase"; gene_biotype "protein_coding"; transcript_name "gfzf-RB"; transcript_source "FlyBase"; transcript_biotype "protein_coding"; exon_id "FBtr0091512-E1"; 3R FlyBase CDS 7150166 7150630 . - 0 gene_id "FBgn0250732"; transcript_id "FBtr0091512"; exon_number "1"; gene_name "gfzf"; gene_source "FlyBase"; gene_biotype "protein_coding"; transcript_name "gfzf-RB"; transcript_source "FlyBase"; transcript_biotype "protein_coding"; protein_id "FBpp0290855"; 3R FlyBase start_codon 7150628 7150630 . - 0 gene_id "FBgn0250732"; transcript_id "FBtr0091512"; exon_number "1"; gene_name "gfzf"; gene_source "FlyBase"; gene_biotype "protein_coding"; transcript_name "gfzf-RB"; transcript_source "FlyBase"; transcript_biotype "protein_coding";

Best regards

OliveiraDS-hub commented 10 months ago

If your gene.gtf file is tab delimited, with nine columns, it is ready to be used.

The most important information is "gene" and "exon" in the third column, and then your FBgn code in the 9th column. ChimeraTE is written to ignore the "gene_id" before your FBgn code, so don't worry.

In this sample that you sent me you have everything you need.

OliveiraDS-hub commented 10 months ago

Closing this issue. Feel free to reopen it.