dieterich-lab / DCC

DCC uses output from the STAR read mapper to systematically detect back-splice junctions in next-generation sequencing data. DCC applies a series of filters and integrates data across replicate sets to arrive at a precise list of circRNA candidates.
https://dieterichlab.org/software/
GNU General Public License v3.0
36 stars 20 forks source link

ValueError: could not convert string to float: chrM #69

Closed hydraphenix closed 2 years ago

hydraphenix commented 5 years ago

Describe the bug I run DCC for 12 mouse samples and had a bug. The error information is shown below:

Using files _tmp_DCC/tmp_circCount and _tmp_DCC/tmp_coordinates for filtering Traceback (most recent call last): File "/workplace/software/python27/bin/DCC", line 11, in load_entry_point('DCC==0.4.4', 'console_scripts', 'DCC')() File "build/bdist.linux-x86_64/egg/DCC/main.py", line 339, in main File "build/bdist.linux-x86_64/egg/DCC/circFilter.py", line 50, in readcirc ValueError: could not convert string to float: chrM

When I check the reason, I found that a error row in "tmp_circCount", I copy some rows around showed below:

chrM 91 1952 0 0 0 0 0 1 0 0 0 0 0 0 chrM 91 8184 0 0 0 0 0 0 0 0 1 0 0 0 chrM 91 11375 chrM chrM chrM chrM chrM chrM chrM chrM chrM chrM chrM chrM chrM 91 14456 0 0 0 0 1 0 0 0 0 0 0 0 chrM 91 14906 0 0 0 0 0 1 0 0 0 0 0 0

command used: DCC -T 10 @samplesheet.txt -mt1 @mate1.txt -mt2 @mate2.txt -m 100000 -n 150 -D -R /workplace/database/mouse/MM10_RepeatMasker_SimpleRepeat.gtf -an /workplace/database/mouse/Mus_musculus_MM10_forRNAseq3875_20170608.gtf -Pi -F -M -Nr 2 1 -A /workplace/database/mouse/BOWTIE2_MM10_Base.fa

OS: CentOS release 6.7 (Final) Python: 2.7.3 DCC: 0.4.7

tjakobi commented 4 years ago

Dear @hydraphenix,

sorry for the late reply. That indeed looks weird. Could you please paste the first hundred or so lines of /workplace/database/mouse/MM10_RepeatMasker_SimpleRepeat.gtf and /workplace/database/mouse/Mus_musculus_MM10_forRNAseq3875_20170608.gtf. In addition column 2 looks weird - 3 is okay - but the start coordinate is always 91?

Cheers, Tobias

hydraphenix commented 4 years ago

Thanks for your reply!

The first 20 rows of Mus_musculus_MM10_forRNAseq3875_20170608.gtf:

chr1 pseudogene gene 3054233 3054733 . + . gene_id "ENSMUSG00000090025"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene"; chr1 unprocessed_pseudogene transcript 3054233 3054733 . + . gene_id "ENSMUSG00000090025"; transcript_id "ENSMUST00000160944"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene"; transcript_name "Gm16088-001"; transcript_source "havana"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF"; chr1 unprocessed_pseudogene exon 3054233 3054733 . + . gene_id "ENSMUSG00000090025"; transcript_id "ENSMUST00000160944"; exon_number "1"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene"; transcript_name "Gm16088-001"; transcript_source "havana"; exon_id "ENSMUSE00000848981"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF"; chr1 snRNA gene 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA"; chr1 snRNA transcript 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; transcript_id "ENSMUST00000082908"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA"; transcript_name "Gm26206-201"; transcript_source "ensembl"; chr1 snRNA exon 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; transcript_id "ENSMUST00000082908"; exon_number "1"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA"; transcript_name "Gm26206-201"; transcript_source "ensembl"; exon_id "ENSMUSE00000522066"; chr1 protein_coding gene 3205901 3671498 . - . gene_id "ENSMUSG00000051951"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; chr1 processed_transcript transcript 3205901 3216344 . - . gene_id "ENSMUSG00000051951"; transcript_id "ENSMUST00000162897"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "Xkr4-003"; transcript_source "havana"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF"; chr1 processed_transcript exon 3213609 3216344 . - . gene_id "ENSMUSG00000051951"; transcript_id "ENSMUST00000162897"; exon_number "1"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "Xkr4-003"; transcript_source "havana"; exon_id "ENSMUSE00000858910"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF"; chr1 processed_transcript exon 3205901 3207317 . - . gene_id "ENSMUSG00000051951"; transcript_id "ENSMUST00000162897"; exon_number "2"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "Xkr4-003"; transcript_source "havana"; exon_id "ENSMUSE00000866652"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF"; chr1 processed_transcript transcript 3206523 3215632 . - . gene_id "ENSMUSG00000051951"; transcript_id "ENSMUST00000159265"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "Xkr4-002"; transcript_source "havana"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF"; chr1 processed_transcript exon 3213439 3215632 . - . gene_id "ENSMUSG00000051951"; transcript_id "ENSMUST00000159265"; exon_number "1"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "Xkr4-002"; transcript_source "havana"; exon_id "ENSMUSE00000863980"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF"; chr1 processed_transcript exon 3206523 3207317 . - . gene_id "ENSMUSG00000051951"; transcript_id "ENSMUST00000159265"; exon_number "2"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "Xkr4-002"; transcript_source "havana"; exon_id "ENSMUSE00000867897"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF"; chr1 protein_coding transcript 3214482 3671498 . - . gene_id "ENSMUSG00000051951"; transcript_id "ENSMUST00000070533"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "Xkr4-001"; transcript_source "ensembl_havana"; tag "CCDS"; ccds_id "CCDS14803"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF"; chr1 protein_coding exon 3670552 3671498 . - . gene_id "ENSMUSG00000051951"; transcript_id "ENSMUST00000070533"; exon_number "1"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "Xkr4-001"; transcript_source "ensembl_havana"; tag "CCDS"; ccds_id "CCDS14803"; exon_id "ENSMUSE00000485541"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF"; chr1 protein_coding CDS 3670552 3671348 . - 0 gene_id "ENSMUSG00000051951"; transcript_id "ENSMUST00000070533"; exon_number "1"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "Xkr4-001"; transcript_source "ensembl_havana"; tag "CCDS"; ccds_id "CCDS14803"; protein_id "ENSMUSP00000070648"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF"; chr1 protein_coding start_codon 3671346 3671348 . - 0 gene_id "ENSMUSG00000051951"; transcript_id "ENSMUST00000070533"; exon_number "1"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "Xkr4-001"; transcript_source "ensembl_havana"; tag "CCDS"; ccds_id "CCDS14803"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF"; chr1 protein_coding exon 3421702 3421901 . - . gene_id "ENSMUSG00000051951"; transcript_id "ENSMUST00000070533"; exon_number "2"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "Xkr4-001"; transcript_source "ensembl_havana"; tag "CCDS"; ccds_id "CCDS14803"; exon_id "ENSMUSE00000449517"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF"; chr1 protein_coding CDS 3421702 3421901 . - 1 gene_id "ENSMUSG00000051951"; transcript_id "ENSMUST00000070533"; exon_number "2"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "Xkr4-001"; transcript_source "ensembl_havana"; tag "CCDS"; ccds_id "CCDS14803"; protein_id "ENSMUSP00000070648"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF"; chr1 protein_coding exon 3214482 3216968 . - . gene_id "ENSMUSG00000051951"; transcript_id "ENSMUST00000070533"; exon_number "3"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "Xkr4-001"; transcript_source "ensembl_havana"; tag "CCDS"; ccds_id "CCDS14803"; exon_id "ENSMUSE00000448840"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF";

The first 20 rows of MM10_RepeatMasker_SimpleRepeat.gtf

chr1 mm10_rmsk exon 67108753 67108881 239.000000 + . gene_id "RLTR17B_Mm"; transcript_id "RLTR17B_Mm"; chr1 mm10_rmsk exon 134217652 134217732 230.000000 - . gene_id "BC1_Mm"; transcript_id "BC1_Mm"; chr1 mm10_rmsk exon 8386826 8389555 8310.000000 - . gene_id "Lx2"; transcript_id "Lx2"; chr1 mm10_rmsk exon 16776989 16779051 32159.000000 + . gene_id "L1_Mus1"; transcript_id "L1_Mus1"; chr1 mm10_rmsk exon 33554409 33554640 216.000000 - . gene_id "B4"; transcript_id "B4"; chr1 mm10_rmsk exon 41943030 41943081 200.000000 + . gene_id "(CA)n"; transcript_id "(CA)n"; chr1 mm10_rmsk exon 50329972 50335398 28308.000000 + . gene_id "L1Md_T"; transcript_id "L1Md_T"; chr1 mm10_rmsk exon 83885791 83886358 4868.000000 - . gene_id "L1Md_T"; transcript_id "L1Md_T_dup1"; chr1 mm10_rmsk exon 92274683 92274767 723.000000 + . gene_id "(TAGG)n"; transcript_id "(TAGG)n"; chr1 mm10_rmsk exon 109051333 109052326 8314.000000 + . gene_id "L1Md_T"; transcript_id "L1Md_T_dup2"; chr1 mm10_rmsk exon 125828928 125829476 3167.000000 + . gene_id "Lx5"; transcript_id "Lx5"; chr1 mm10_rmsk exon 167772061 167772244 493.000000 - . gene_id "L1M2"; transcript_id "L1M2"; chr1 mm10_rmsk exon 184549327 184549452 584.000000 + . gene_id "B3A"; transcript_id "B3A"; chr1 mm10_rmsk exon 3145674 3145796 314.000000 - . gene_id "RMER16A3"; transcript_id "RMER16A3"; chr1 mm10_rmsk exon 5242238 5242959 3620.000000 - . gene_id "RMER13B"; transcript_id "RMER13B"; chr1 mm10_rmsk exon 7339881 7340133 1530.000000 - . gene_id "MYSERV6-int"; transcript_id "MYSERV6-int"; chr1 mm10_rmsk exon 9436683 9437312 2842.000000 + . gene_id "RLTR1D2_MM"; transcript_id "RLTR1D2_MM"; chr1 mm10_rmsk exon 11533508 11534907 30539.000000 - . gene_id "L1_Mus3"; transcript_id "L1_Mus3"; chr1 mm10_rmsk exon 18872425 18877658 35311.000000 + . gene_id "L1Md_T"; transcript_id "L1Md_T_dup3"; chr1 mm10_rmsk exon 20971461 20971616 765.000000 - . gene_id "B1_Mur3"; transcript_id "B1_Mur3";

And I have runned DCC for identifying circRNAs in mouse serveral times. And only this time I met this problem. And If I run this project without filter, No bugs returned. No filter command:

DCC -T 10 @samplesheet.txt -mt1 @mate1.txt -mt2 @mate2.txt -m 100000 -n 150 -D -R /workplace/database/mouse/MM10_RepeatMasker_SimpleRepeat.gtf
-an /workplace/database/mouse/Mus_musculus_MM10_forRNAseq3875_20170608.gtf -Pi -A /workplace/database/mouse/BOWTIE2_MM10_Base.fa

Thanks again!

tjakobi commented 4 years ago

Hi @hydraphenix,

thanks you for your feedback. Would it be possible to upload both GTF files along with the DCC output files ( + tmp_circCount)? If you like you can use my upload bin:

https://data.dieterichlab.org/s/circtools_debug_upload

Also, how did you install DCC? 0.4.4 is an older version, 0.4.7 is the most recent one.

Thank you, Tobias

egaffo commented 3 years ago

Dear @tjakobi , I had the same error (tough using v0.4.8, parameters -F -Nr 1 1 -N -D), and I think I spotted where stands the issue. The error may arise, depending on the input data, from the CombineCounts.map() function, and you should see the culprit line in the Chimeric.out.junction.circRNAmapped file. Below an example that raised the error in my data.

According to the code populating the mapto dict that composes the keys by concatenating chr, start and end, these two lines (putative back splices) will be assigned the same key MN908947.350327859:

MN908947.3    503    27859    .    0    -
MN908947.3    5032    7859    .    2    +

>>>mapto['MN908947.350327859'] MN908947.350327859 ['MN908947.3\t503\t27859\t.\t0\t-', 'MN908947.3\t5032\t7859\t.\t2\t+']

After composing the run_mapto dict, such an entry will be printed ('\t'.join(run_mapto[key]) + '\n') in the Chimeric.out.junction.circRNAmapped as:

MN908947.3      503     27859   .       0       -       MN908947.3      5032    7859    .       2       +       4       1

instead of two lines:

MN908947.3      503     27859   .       0       -      4
MN908947.3      503     27859   .       0       -      1

Note that the keys would be different in the stranded mode because of the strand sign in the key (MN908947.350327859- and MN908947.350327859+). Nonetheless, the error could arise if the entries had the same strand.

The error line @hydraphenix got

chrM 91 11375 chrM chrM chrM chrM chrM chrM chrM chrM chrM chrM chrM chrM

indicates that in each of the 12 samples, there was an ambiguous key, perhaps given by

chrM    91    11375
chrM    911    1375

Putting a separator character between the chr, start and end (and strand) when composing the keys could resolve the issue.

I hope it helps,

Enrico

tjakobi commented 3 years ago

Thank you for the patch @egaffo, I'll integrate your patch in the master tree.