dieterich-lab / DCC

DCC uses output from the STAR read mapper to systematically detect back-splice junctions in next-generation sequencing data. DCC applies a series of filters and integrates data across replicate sets to arrive at a precise list of circRNA candidates.
https://dieterichlab.org/software/
GNU General Public License v3.0
36 stars 20 forks source link

Issue running DCC #58

Closed cmonger closed 5 years ago

cmonger commented 5 years ago

Dear Dieterich-lab,

I'm having some issues running DCC (version 0.4.7) which I hope you can help me with.

I'm trying to run the software in a virtual environment on python version 2.7.15rc1

The only packages installed in the environment are FUCHs, DCC, and the relavent dependancies.

When running DCC I get this output/error:

> Temporary folder _tmp_DCC/ already exists, reusing
> DCC 0.4.7 started
> Traceback (most recent call last):
>   File "/home/craig/circRNA/output/python-virtual-environments/env/bin/DCC", line 11, in <module>
>     load_entry_point('DCC==0.4.7', 'console_scripts', 'DCC')()
>   File "build/bdist.linux-x86_64/egg/DCC/main.py", line 146, in main
>   File "build/bdist.linux-x86_64/egg/DCC/main.py", line 529, in remove_empty_lines
> TypeError: 'NoneType' object is not iterable

I get the same error if I use the suggested parameters (and replicate samples and mates supplied in samplesheet/mate1/mate2 files) or even when running the most 'simple' command possible e.g:


DCC total-rna-1/Chimeric.out.junction

or

DCC @sampleFile.txt -mt1 @mate1.txt -mt2 @mate2.txt -D -R repeatsSort.gtf -an merged.gtf -Pi -F -M -Nr 2 2 -fg -G -A genome.fa -T 70 

Please let me know if there is any information I can provide to help fix this issue.

Thanks, Craig

tjakobi commented 5 years ago

Dear @cmonger,

thank you for reporting this error. I recently added a fix to address empty lines in input files and it seems this error may be an unintended side effect of that fix. I should be able to provide a fix shortly.

Thank you, Tobias

tjakobi commented 5 years ago

Dear @cmonger,

I fixed the issue in the latest push, please update your DCC installation.

You may, however additionally supply the BAM files via the -B parameter as -B @bam_list or similar. The issue you reported didn't come up for me earlier because I always specify the BAM files and thus the list of files was not empty.

Cheers, Tobias

cmonger commented 5 years ago

@tjakobi Thanks for the quick fix. I'll give it a try now. Should I provide the BAM files even if I intend to use the results as input to FUCHS? Will doing so improve the accuracy/results of DCC?
I assume I should only provide the bam files of the paired-end aligned reads rather than the individual mates?

tjakobi commented 5 years ago

Dear @cmonger,

the BAM files are used in any case by DCC, the parameter is just used to directly supply the names - otherwise DCC tries to guess the names from the supplied Chimeric.out.junction file names. If provided via the -B command, the list should only contain the "main" mapping BAMs, not the additional mate mappings.

Cheers, Tobias

udube commented 5 years ago

Hello,

I am having the same error.

Here is the command I use:

DCC @DCC_InputFiles/samplesheet -mt1 @DCC_InputFiles/read -T 20 -D -N -R GRCh38_Repeats_simpleRepeats_RepeatMasker.gtf -an gencode.v26.primary_assembly.annotation.gtf -F -M -Nr 1 1 -fg -G -A GRCh38.primary_assembly.genome.fa -B @DCC_InputFiles/bam_files

Here is the error I receive:

DCC 0.4.7 started
Traceback (most recent call last):
  File "/home/dubeu/.local/bin/DCC", line 9, in <module>
    load_entry_point('DCC==0.4.7', 'console_scripts', 'DCC')()
  File "build/bdist.linux-x86_64/egg/DCC/main.py", line 145, in main
  File "build/bdist.linux-x86_64/egg/DCC/main.py", line 528, in remove_empty_lines
TypeError: 'NoneType' object is not iterable

Thanks

tjakobi commented 5 years ago

Dear @udube,

thank you for reporting the issues.

It seems like you only provided one mate via -mt1 but not the second mate via -mt2. This triggers the error you are seeing.

I'll include a appropriate warning message in the next commit.

Cheers, Tobias

tjakobi commented 5 years ago

Dear @udube,

I addressed your issue in commit c5ae23d which introduces a more consistent check of command line arguments before check the file contents. Additional error messages are also shown now.

Cheers, Tobias

knuppd commented 5 years ago

Hi,

Like in a previous issue, I cannot get DCC to run past the first few steps with an error produced (same as #58 ). I just downloaded DCC yesterday so i'm sure this is the most current version. I am following the tutorial steps with my own data. I have created the samplesheet, mate1, and mate2 files. Here is the error:

Bash script contents: `#!/usr/bin/env bash

GENOME=/archive/miuralab/dknupp/Index/STAR/mm10-star

DCC @samplesheet \ -mt1 @mate1 \ -mt2 @mate2 \ -D \ -R Ensemble_UCSC_Repeats.gtf \ -an Ensemble_UCSC_GenePrediction.gtf \ -Pi \ -F \ -M \ -Nr 2 5 \ -fg \ -G \ -A $GENOME/Genome`

I have checked, and samplesheet, mate1, and mate2 do not have any empty lines. So i'm not sure where this error is being produced.

Also for the genome.fa file, I built the index using STAR (which I used in the first steps for the alignment). These files don't have .fa or .fai after them, will this be a problem?

Thanks for the help,

Dave

Incase this is helpful, below are contents of input files

samplesheet file content: ~/path/to/WT_Rep1Chimeric.out.junction ~/path/to/WT_Rep2Chimeric.out.junction ~/path/to/WT_Rep3Chimeric.out.junction ~/path/to/KO_Rep1Chimeric.out.junction ~/path/to/KO_Rep2Chimeric.out.junction ~/path/to/KO_Rep3Chimeric.out.junction

mate1 file content: ~/path/to/mate1/WT_Rep1Chimeric.out.junction ~/path/to/mate1/WT_Rep2Chimeric.out.junction ~/path/to/mate1/WT_Rep3Chimeric.out.junction ~/path/to/mate1/KO_Rep1Chimeric.out.junction ~/path/to/mate1/KO_Rep2Chimeric.out.junction ~/path/to/mate1/KO_Rep3Chimeric.out.junction

mate2 file content: ~/path/to/mate2/WT_Rep1Chimeric.out.junction ~/path/to/mate2/WT_Rep2Chimeric.out.junction ~/path/to/mate2/WT_Rep3Chimeric.out.junction ~/path/to/mate2/KO_Rep1Chimeric.out.junction ~/path/to/mate2/KO_Rep2Chimeric.out.junction ~/path/to/mate2/KO_Rep3Chimeric.out.junction

contents of repeats .gtf (head -n3): 1 mm10_simpleRepeat exon 3000098 3000123 52.000000 . . gene_id "trf"; transcript_id "trf"; 1 mm10_simpleRepeat exon 3003165 3003639 564.000000 . . gene_id "trf"; transcript_id "trf_dup1"; 1 mm10_simpleRepeat exon 3003486 3003546 81.000000 . . gene_id "trf"; transcript_id "trf_dup2";

contents of genepredictions .gtf (head -n3): 1 mm10_knownGene exon 3205904 3207317 0.000000 - . gene_id "uc007aet.1"; transcript_id "uc007aet.1"; 1 mm10_knownGene exon 3213439 3215632 0.000000 - . gene_id "uc007aet.1"; transcript_id "uc007aet.1"; 1 mm10_knownGene stop_codon 3216022 3216024 0.000000 - . gene_id "uc007aeu.1"; transcript_id "uc007aeu.1";

@tjakobi

tjakobi commented 5 years ago

Dear @knuppd,

thank you for your feedback. Just to get the versions straight, did you do a git clone of the master branch or download the latest release from the release tab?

Cheers, Tobias

knuppd commented 5 years ago

Ho Tobias,

Thanks for the quick reply, I downloaded the latest release, should i use git clone?

Best,

Dave


From: Tobias Jakobi notifications@github.com Sent: Monday, March 25, 2019 4:51 AM To: dieterich-lab/DCC Cc: knuppdav; Mention Subject: Re: [dieterich-lab/DCC] Issue running DCC (#58)

Dear @knuppdhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_knuppd&d=DwMCaQ&c=nE__W8dFE-shTxStwXtp0A&r=J5MVSzpPIK1DszxXiQP6eg&m=6SoX-zTr_vShrWUw2KM19i4oEVF88xfOW4r6kuQ8j_c&s=5-Isj8ChZP8TLedNBy_PKbxTTEqrEx8eYT6SCRRPo8s&e=,

thank you for your feedback. Just to get the versions straight, did you do a git clone of the master branch or download the latest release from the release tab?

Cheers, Tobias

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dieterich-2Dlab_DCC_issues_58-23issuecomment-2D476163684&d=DwMCaQ&c=nE__W8dFE-shTxStwXtp0A&r=J5MVSzpPIK1DszxXiQP6eg&m=6SoX-zTr_vShrWUw2KM19i4oEVF88xfOW4r6kuQ8j_c&s=-6yV1tCxEQCMLenzK4yC9nCPhjczlOHrmlyFP6HUETU&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_Aun7qdDXDc-2DD-5FueI12PlwqyYX1uQrG24ks5vaLhIgaJpZM4YLj3L&d=DwMCaQ&c=nE__W8dFE-shTxStwXtp0A&r=J5MVSzpPIK1DszxXiQP6eg&m=6SoX-zTr_vShrWUw2KM19i4oEVF88xfOW4r6kuQ8j_c&s=LQVpLjf6iwbbb5v1HGGMx0jfdpSlmXKcyXGd9PeWDuM&e=.

tjakobi commented 5 years ago

Hi @knuppd,

I did not compile a new release of DCC yet, therefore the latest release does not yet include the fix from this thread. If you do the git clone you will get the version with the fixed code.

Cheers, Tobias

knuppd commented 5 years ago

Hi @tjakobi,

Thank you I was able to get the program to run after downloading using git. However, I did not get two of the output files: LinearCount and CircSkipJunctions. It seems as though one of the bam files maybe caused an error but i'm not sure if this is the cause of the problem. (I have counts though, for this bam file assigned to the circRNA output file (below)).

CircRNACount File:


Chr    Start      End        WT_Rep1Chimeric.out.junction  WT_Rep2Chimeric.out.junction  WT_Rep3Chimeric.out.junction  KO_Rep1Chimeric.out.junction  KO_Rep2Chimeric.out.junction  KO_Rep3Chimeric.out.junction
chr1   7120194    7120615    14                            12                            10                            8                                  4                                  7
chr1   7120194    7120615    16                            8                             15                            6                                  5                                  2
chr1   16389821   16394210   3                             2                             1                             3                                  3                                  3
chr1   16389821   16394210   2                             2                             3                             2                                  2                                  0
chr1   16440323   16463194   2                             2                             2                             2                                  3                                  2
chr1   16440323   16486150   3                             2                             3                             4                                  4                                  0
chr1   16440323   16509408   22                            17                            15                            17                                 18                                 12
chr1   16440323   16509408   15                            12                            5                             23                                 11                                 16
chr1   16463035   16509408   6                             7                             3                             1                                  4                                  3```

-----UCSC mm10 annotation file-----

[dknupp@ponderosa circRNA]$ cat Ensemble_UCSC_GenePrediction.gtf | head
1   mm10_knownGene  exon    3205904 3207317 0.000000    -   .   gene_id "uc007aet.1"; transcript_id "uc007aet.1"; 
1   mm10_knownGene  exon    3213439 3215632 0.000000    -   .   gene_id "uc007aet.1"; transcript_id "uc007aet.1"; 
1   mm10_knownGene  stop_codon  3216022 3216024 0.000000    -   .   gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
1   mm10_knownGene  CDS 3216025 3216968 0.000000    -   2   gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
1   mm10_knownGene  exon    3214482 3216968 0.000000    -   .   gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
1   mm10_knownGene  CDS 3421702 3421901 0.000000    -   1   gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
1   mm10_knownGene  exon    3421702 3421901 0.000000    -   .   gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
1   mm10_knownGene  CDS 3670552 3671348 0.000000    -   0   gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
1   mm10_knownGene  start_codon 3671346 3671348 0.000000    -   .   gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 
1   mm10_knownGene  exon    3670552 3671498 0.000000    -   .   gene_id "uc007aeu.1"; transcript_id "uc007aeu.1"; 

------DCC LOG FILE------

2019-03-25 08:48:36,305 DCC 0.4.7 started
2019-03-25 08:48:36,305 DCC command line: /home/dknupp/.local/bin/DCC @samplesheet -mt1 @mate1 -mt2 @mate2 -D -R Ensemble_UCSC_Repeats.gtf -an Ensemble_UCSC_GenePrediction.gtf -Pi -F -M -Nr 2 5 -fg -G -A /archive/miuralab/dknupp/Index/STAR/mm10-star/Genome
2019-03-25 08:48:36,312 Starting to detect circRNAs
2019-03-25 08:48:36,312 Stranded data mode
2019-03-25 08:48:36,312 Please make sure that the read pairs have been mapped both, combined and on a per mate basis
2019-03-25 08:48:36,312 Collecting chimera information from mates-separate mapping
2019-03-25 08:49:33,750 started circRNA detection from file _tmp_DCC/WT_Rep1Chimeric.out.junction.0G58O0
2019-03-25 08:49:33,750 started circRNA detection from file _tmp_DCC/WT_Rep2Chimeric.out.junction.UH1FWS
2019-03-25 08:53:27,732 finished circRNA detection from file _tmp_DCC/WT_Rep2Chimeric.out.junction.UH1FWS
2019-03-25 08:53:27,733 started circRNA detection from file _tmp_DCC/WT_Rep3Chimeric.out.junction.8855XC
2019-03-25 08:54:50,136 Read 82383662.+.82327728.HWI-D00269:94:C5MB2ANXX:8:2205:13950:16471 has more than 2 count.
2019-03-25 08:54:50,141 Read 82383662.+.82327728.HWI-D00269:94:C5MB2ANXX:8:2205:13950:16471 has more than 2 count.
2019-03-25 08:55:03,304 finished circRNA detection from file _tmp_DCC/WT_Rep1Chimeric.out.junction.0G58O0
2019-03-25 08:55:03,306 started circRNA detection from file _tmp_DCC/KO_Rep1Chimeric.out.junction.H9HNUW
2019-03-25 08:57:24,027 Read 150510039.+.150500002.HWI-D00269:94:C5MB2ANXX:6:2204:6389:73208 has more than 2 count.
2019-03-25 08:57:24,029 Read 150510039.+.150500002.HWI-D00269:94:C5MB2ANXX:6:2204:6389:73208 has more than 2 count.
2019-03-25 08:57:47,572 Read 150510039.+.150500002.HWI-D00269:94:C5MB2ANXX:6:2204:6389:73208 has more than 2 count.
2019-03-25 08:59:03,852 Read 82383662.+.82327728.HWI-D00269:94:C5MB2ANXX:8:2205:13950:16471 has more than 2 count.
2019-03-25 08:59:14,463 finished circRNA detection from file _tmp_DCC/KO_Rep1Chimeric.out.junction.H9HNUW
2019-03-25 08:59:14,464 started circRNA detection from file _tmp_DCC/KO_Rep2Chimeric.out.junction.C9TIQ6
2019-03-25 09:00:18,736 finished circRNA detection from file _tmp_DCC/WT_Rep3Chimeric.out.junction.8855XC
2019-03-25 09:00:18,737 started circRNA detection from file _tmp_DCC/KO_Rep3Chimeric.out.junction.PYK756
2019-03-25 09:04:20,738 finished circRNA detection from file _tmp_DCC/KO_Rep2Chimeric.out.junction.C9TIQ6
2019-03-25 09:06:45,822 finished circRNA detection from file _tmp_DCC/KO_Rep3Chimeric.out.junction.PYK756
2019-03-25 09:06:45,822 Combining individual circRNA read counts
2019-03-25 09:08:05,319 Write in annotation
2019-03-25 09:08:05,320 Select gene features in Annotation file
2019-03-25 09:09:24,622 Filtering started
2019-03-25 09:09:24,623 Using files _tmp_DCC/tmp_circCount and _tmp_DCC/tmp_coordinates for filtering
2019-03-25 09:09:34,078 Filtering by read counts
2019-03-25 09:09:35,541 Filter by non repetitive region
2019-03-25 09:18:46,319 Deleting circRNA candidates from mitochondrial chromosome
2019-03-25 09:18:46,432 Filtering by gene annotation. CircRNA candidates from more than one genes are deleted
2019-03-25 09:18:46,527 Filtering finished
2019-03-25 09:18:46,527 No BAM files provided (-B) trying to automatically guess BAM file names
2019-03-25 09:18:46,596 BAM file /archive/miuralab/dknupp/Results/DCC/M_musculus_20190323/mRNA/KO_Rep3Aligned.sortedByCoord.out.bam has no index (/archive/miuralab/dknupp/Results/DCC/M_musculus_20190323/mRNA/KO_Rep3Aligned.sortedByCoord.out.bam.bai is missing)
2019-03-25 09:18:46,597 The following BAM files seem to be not sorted by coordinate or are missing an index:
2019-03-25 09:18:46,597 /archive/miuralab/dknupp/Results/DCC/M_musculus_2KO_20190323/mRNA/KO_Rep3Aligned.sortedByCoord.out.bam
tjakobi commented 5 years ago

Hi @knuppd,

thanks for reporting back. Are the BAM files sorted and have the .bai index file in the same folder? DCC assumes that that is the case.

Cheers, Tobias

knuppd commented 5 years ago

Hi @tjakobi ,

They say that they are ".sortedByCoord.out.bam" But i have not run samtools index. Was I supposed to do this prior to running DCC?

Here is the contents of the folder in which both mate1 and mate2 were mapped to the genome.

image

Best,

Dave

tjakobi commented 5 years ago

Dear @knuppd,

STAR already sorted the BAM files correctly but they are missing the index. You can create the index via samtools index file.bam.

To speed things up you may run the command in a loop so you don't have to start it for every file yourself: for i in *.bam; do samtools index $i; done

Cheers, Tobias

knuppd commented 5 years ago

Hi @tjakobi,

Thanks for the help. I am still not getting the ouput, maybe something is wrong with the python i have from miniconda? It seems to be having problems. Maybe I should try a different version? current version i'm using is 2.7.15.

BAM file for KORep3 looks like the other files so i'm not sure what this header mistake is or if it is related to the earlier flag.

Here is my log file:


nohup: ignoring input
Output folder ./ already exists, reusing
Temporary folder _tmp_DCC/ already exists, reusing
DCC 0.4.7 started
32 CPU cores available, using 2
[E::fai_build_core] Format error, unexpected character at line 1
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/home/dknupp/miniconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/dknupp/miniconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/dknupp/miniconda2/lib/python2.7/multiprocessing/pool.py", line 392, in _handle_results
    task = get()
TypeError: __init__() takes exactly 2 arguments (1 given)

[E::fai_build_core] Format error, unexpected character at line 1
[E::fai_build_core] Format error, unexpected character at line 1
[E::fai_build_core] Format error, unexpected character at line 1
started circRNA detection from file _tmp_DCC/WT_Rep1Chimeric.out.junction.TDSRDG
    => separating duplicates [_tmp_DCC/WT_Rep1Chimeric.out.junction.TDSRDG]
    => locating small circRNAs [_tmp_DCC/WT_Rep1Chimeric.out.junction.TDSRDG]
    => locating circRNAs (stranded mode) [_tmp_DCC/WT_Rep1Chimeric.out.junction.TDSRDG]
    => merging circRNAs [_tmp_DCC/WT_Rep1Chimeric.out.junction.TDSRDG]
    => sorting circRNAs (stranded mode) [_tmp_DCC/WT_Rep1Chimeric.out.junction.TDSRDG]
finished circRNA detection from file _tmp_DCC/WT_Rep1Chimeric.out.junction.TDSRDG
started circRNA detection from file _tmp_DCC/KO_Rep1Chimeric.out.junction.4D2T6Y
    => separating duplicates [_tmp_DCC/KO_Rep1Chimeric.out.junction.4D2T6Y]
Read 150510039.+.150500002.HWI-D00269:94:C5MB2ANXX:6:2204:6389:73208 has more than 2 count.
Read 150510039.+.150500002.HWI-D00269:94:C5MB2ANXX:6:2204:6389:73208 has more than 2 count.
Read 150510039.+.150500002.HWI-D00269:94:C5MB2ANXX:6:2204:6389:73208 has more than 2 count.
    => locating small circRNAs [_tmp_DCC/KO_Rep1Chimeric.out.junction.4D2T6Y]
    => locating circRNAs (stranded mode) [_tmp_DCC/KO_Rep1Chimeric.out.junction.4D2T6Y]
    => merging circRNAs [_tmp_DCC/KO_Rep1Chimeric.out.junction.4D2T6Y]
    => sorting circRNAs (stranded mode) [_tmp_DCC/KO_Rep1Chimeric.out.junction.4D2T6Y]
finished circRNA detection from file _tmp_DCC/KO_Rep1Chimeric.out.junction.4D2T6Y
started circRNA detection from file _tmp_DCC/KO_Rep2Chimeric.out.junction.BDWNKL
    => separating duplicates [_tmp_DCC/KO_Rep2Chimeric.out.junction.BDWNKL]
    => locating small circRNAs [_tmp_DCC/KO_Rep2Chimeric.out.junction.BDWNKL]
    => locating circRNAs (stranded mode) [_tmp_DCC/Nova2KO_Rep2Chimeric.out.junction.BDWNKL]
    => merging circRNAs [_tmp_DCC/KO_Rep2Chimeric.out.junction.BDWNKL]
    => sorting circRNAs (stranded mode) [_tmp_DCC/KO_Rep2Chimeric.out.junction.BDWNKL]
finished circRNA detection from file _tmp_DCC/KO_Rep2Chimeric.out.junction.BDWNKL
Counting host gene expression based on detected and filtered circRNA coordinates for /archive/miuralab/dknupp/Results/DCC/M_musculus_Cortex_20190323/mRNA/KO_Rep3Aligned.sortedByCoord.out.bam
Started linear gene expression counting for /archive/miuralab/dknupp/Results/DCC/M_musculus_Cortex_20190323/mRNA/KO_Rep3Aligned.sortedByCoord.out.bam
    => running mpileup for start positions [/archive/miuralab/dknupp/Results/DCC/M_musculus_20190323/mRNA/KO_Rep3Aligned.sortedByCoord.out.bam]
Counting host gene expression based on detected and filtered circRNA coordinates for /archive/miuralab/dknupp/Results/DCC/M_musculus_Cortex_20190323/mRNA/KO_Rep3Aligned.sortedByCoord.out.bam
Started linear gene expression counting for /archive/miuralab/dknupp/Results/DCC/M_musculus_Cortex_20190323/mRNA/KO_Rep3Aligned.sortedByCoord.out.bam
    => running mpileup for start positions [/archive/miuralab/dknupp/Results/DCC/M_musculus_Cortex_20190323/mRNA/KO_Rep3Aligned.sortedByCoord.out.bam]
Counting host gene expression based on detected and filtered circRNA coordinates for /archive/miuralab/dknupp/Results/DCC/M_musculus_Cortex_20190323/mRNA/KO_Rep3Aligned.sortedByCoord.out.bam
Started linear gene expression counting for /archive/miuralab/dknupp/Results/DCC/M_musculus_Cortex_20190323/mRNA/KO_Rep3Aligned.sortedByCoord.out.bam
    => running mpileup for start positions [/archive/miuralab/dknupp/Results/DCC/M_musculus_Cortex_20190323/mRNA/KO_Rep3Aligned.sortedByCoord.out.bam]
Counting host gene expression based on detected and filtered circRNA coordinates for /archive/miuralab/dknupp/Results/DCC/M_musculus_Cortex_20190323/mRNA/KO_Rep3Aligned.sortedByCoord.out.bam
Started linear gene expression counting for /archive/miuralab/dknupp/Results/DCC/M_musculus_Cortex_20190323/mRNA/KO_Rep3Aligned.sortedByCoord.out.bam
    => running mpileup for start positions [/archive/miuralab/dknupp/Results/DCC/M_musculus_Cortex_20190323/mRNA/KO_Rep3Aligned.sortedByCoord.out[E::fai_build_core] Format error, unexpected character at line 1
[E::fai_build_core] Format error, unexpected character at line 1```
tjakobi commented 5 years ago

Dear @knuppd,

it seems this is more an issue with a FASTA file index than with the BAM files. In the directory containing the genome FASTA file, could you please regenerate the index?

samtools faidx your_fasta_file.fa

Cheers, Tobias

knuppd commented 5 years ago

Dear @tjakobi

Thank you very much for your help getting DCC to run. I really appreciate it.

I have now successfully gotten the program to run with the anticipated output files. I guess my last question is if its normal to have repeats of circRNA cooridinates?

In my output file of circRNACount there are repeats of circRNAs (same coordinates) but they have different count values?

Chr Start   End WT_Rep1Chimeric.out.junction    WT_Rep2Chimeric.out.junction    WT_Rep3Chimeric.out.junction    KO_Rep1Chimeric.out.junction    KO_Rep2Chimeric.out.junction    KO_Rep3Chimeric.out.junction
chr1    7120194 7120615 14  12  10  8   4   7
chr1    7120194 7120615 16  8   15  6   5   2
chr1    16389821    16394210    3   2   1   3   3   3
chr1    16389821    16394210    2   2   3   2   2   0
chr1    16440323    16463194    2   2   2   2   3   2
chr1    16440323    16486150    3   2   3   4   4   0
chr1    16440323    16509408    22  17  15  17  18  12
chr1    16440323    16509408    15  12  5   23  11  16
chr1    16463035    16509408    6   7   3   1   4   3

Best,

Dave

tjakobi commented 5 years ago

Dear @knuppd,

that indeed looks strange. Did you look in the CircCorrdinates file to take a look at the strand? It may be the case the same circRNA was detected on both strands and/or the strandness flag for DCC might not have been set correctly.

Cheers, Tobias

knuppd commented 5 years ago

Hi @tjakobi

Checking the CircCoordinates file, yes it does seem that DCC is detecting a circRNA on both strands. Should the -ss option be used for stranded libraries? I thought the default was "stranded". Does -ss have to be set regardless?

Best, Dave

tjakobi commented 5 years ago

Hi @knuppd,

generally, do you see the majority of the circRNAs annotated as "not annotated"? That would be an indicator that -ss should be used. By default, DCC assumes a stranded setup (non-stranded would be -N), the -ss flag can be understood as direction, i.e. first or second-strand, comparable to TopHat.

Cheers Tobias

knuppd commented 5 years ago

Hi @tjakobi,

Yes, I generally see that one of the two coordinates (the one with the proper strand) will have an annotated GeneID and the other will not. I have illumina, dUTP libraries. So should I be using -ss fr-firststrand as the option?

tjakobi commented 5 years ago

Hi @knuppd,

something I should definitely put in the documentation. I also wrote it at some point as part of the protocol paper for the circtools workflow:

DCC supports different sequencing protocols widely used for whole-transcriptome experiments. Generally, the user has the option to select between the stranded and unstranded input data (by using the -N flag, stranded mode is the default). That is, for a stranded library DCC is able to distinguish if a circRNA is located on the sense or antisense strand. If the sequencing library was pre- pared in unstranded manner, DCC assigns the strand of the circular based on the host gene’s strand since the aligned reads do not pro- vide strand-specific information. Additionally, DCC allows the user to select the library preparation type based on the technique employed—either “firststrand” or “secondstrand” (by using the -ss flag). Which option is the correct one depends on the employed sequencing protocol.

The following list gives an overview of com- mon sequencing kits and the respective parameter choice: First-strand kits (default):

● All dUTP methods, NSR, NNSR ● TruSeq Stranded Total RNA Sample Prep Kit ● TruSeq Stranded mRNA Sample Prep Kit ● NEB Ultra Directional RNA Library Prep Kit ● Agilent SureSelect Strand-Specific

Second-strand kits (second-strand parameter -ss has to be used):

● Directional Illumina (Ligation), Standard SOLiD ● ScriptSeq v2 RNA-Seq Library Preparation Kit ● SMARTer Stranded Total RNA ● Encore Complete RNA-Seq Library Systems

Source: Deep Computational Circular RNA Analytics from RNA-seq Data

From this list the correct choice would be to not use -ss nor -N.

knuppd commented 5 years ago

Hi @tjakobi,

Thank you for the info. Doesn't it seem odd though that I have a large proportion of my detected circRNAs having different strands (i.e. strands not of the parent mRNA)? From my CircCoordinates output file, I have 3258 detected circRNAs. If I remove those that have the same chromosome, start and stop position I am left with 2209 circRNAs (i.e. if there are 2 circRNAs with the same chr/start/stop, only 1 will be kept and the other removed). That means 33% of the detected circRNAs are on the opposite strand as the mRNA is coded.

knuppd commented 5 years ago

options used while running DCC:


GENOME=/archive/mlab/dknupp/Index/STAR
GTF=/archive/mlab/dknupp/Index/STAR

DCC @samplesheet \
        -mt1 @mate1 \
        -mt2 @mate2 \
        -D \
        -R Ensemble_UCSC_Repeats.gtf \
        -an $GTF/gencode.vM17.primary_assembly.annotation.gtf \
        -Pi \
        -F \
        -M \
        -Nr 1 4 \
        -fg \
        -G \
        -B @bamfiles \
        -A $GENOME/GRCm38.primary_assembly.genome.fa```
tjakobi commented 5 years ago

Dear @knuppd,

indeed this behavior is note expected. Would you be able to share one or more of the BAM files (+the chimeric junction files) for further debugging?

I can provide necessary upload capacity if required.

Cheers Tobias

knuppd commented 5 years ago

Hi @tjakobi,

Yes I can, do you have an email I can contact you with? I could upload them via dropbox to you.

Best, Dave

tjakobi commented 5 years ago

Hi @knuppd,

just use circtools@dieterichlab.org.

Thank you! Tobias

knuppd commented 5 years ago

Dear @tjakobi,

I have shared these files with your email above via dropbox.

Best,

Dave

tjakobi commented 5 years ago

Dear @knuppd,

I am downloading the files now. I will report back as soon as I can pinpoint the issue.

Cheers, Tobias

tjakobi commented 5 years ago

Hi @knuppd,

for your reference, I was able to reproduce the issue and try go investigate what is happening there.

Cheers, Tobias

tjakobi commented 5 years ago

Dear @knuppd,

I looked through the BAM file, especially the header and saw that you used STAR 2.6.0b. STAR changed the chimeric output format with version 2.6.0+, therefore requiring to additionally specify

--chimOutType Junctions SeparateSAMold. (see also the main documentation of circtools)

I could not find the parameter in the CLI call for STAR, maybe you could rerun the mapping with the additional flag and try running DCC on the newly generated junction files?

Cheers, Tobias

knuppd commented 5 years ago

Hi @tjakobi,

I have tried re-running the alignments with this option. However the same problem persists.

For reference, here is the alignment command:

#! /usr/bin/env bash

set -exo pipefail

#STAR alignment 

SAMPLES="WT_Rep1 WT_Rep2 WT_Rep3 KO_Rep1 KO_Rep2 KO_Rep3"
GENOME=/archive/mlab/dknupp/Index/STAR/mm10-star
READPATH=/archive/mlab/dknupp/Data/M_musculus_Cortex_20190221

#STEP1: Map paired end data using both mates (_R1 and _R2)
#<<STAR
for SAMPLE in $SAMPLES; do
        STAR --runThreadN 10 \
                --genomeDir $GENOME \
                --outSAMtype BAM SortedByCoordinate \
                --readFilesIn $READPATH/${SAMPLE}_1.fastq.gz $READPATH/${SAMPLE}_2.fastq.gz \
                --readFilesCommand zcat \
                --outFileNamePrefix $SAMPLE \
                --outReadsUnmapped Fastx \
                --outSJfilterOverhangMin 15 15 15 15 \
                --alignSJoverhangMin 15 \
                --alignSJDBoverhangMin 15 \
                --outFilterMultimapNmax 20 \
                --outFilterScoreMin 1 \
                --outFilterMatchNmin 1 \
                --outFilterMismatchNmax 2 \
                --chimSegmentMin 15 \
                --chimScoreMin 15 \
                --chimScoreSeparation 10 \
                --chimJunctionOverhangMin 15 \
                --chimOutType Junctions SeparateSAMold
done
STAR

Best,

Dave

tjakobi commented 5 years ago

Thank you for your effort @knuppd. Would it be possible to compare the junction files and - if they are different from the previous run - upload them?

Cheers, Tobias

knuppd commented 5 years ago

Hi @tjakobi,

I looked at the chimeric junction files and after sorting them numerically on the first 2 columns and comparing the first 20 lines of each they appear identical. I also check the individual mate1 and mate2 chimeric junction files as well.

Best,

Dave

tjakobi commented 5 years ago

Hi @knuppd,

thank you for looking into this. I looked into the BAM file with rseqc and it seems the the library is not strand specific?

infer_experiment.py -i WT_Rep1Aligned.sortedByCoord.out.bam -r out.bed 
Reading reference gene model out.bed ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled

This is PairEnd Data
Fraction of reads failed to determine: 0.0369
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4803
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4828

In case of a non-stranded library, DCC cannot know from which strand a read originates and will probably end up putting 50% of the reads on one strand each. In this case DCC also cannot determine if the circRNA is novel or just belongs to the annotated gene on the other strand.

Since I hardly see unstranded libraries anymore the code base for this workflow may not work perfectly - DCC is supposed to assign circRNAs in this case based on known annotations. I will have to look into this. Could you run DCC again with -N and see if the double entries are gone?

Cheers, Tobias

knuppd commented 5 years ago

Hi @tjakobi,

Thank you, I will try running the data with the -N option. The data I am analyzing is from a 2019 paper that specified in the methods that the library prep they used was Illumina ribo-zero tru-seq protocol, which is why I assumed this was stranded data. Since this data appears to be unstranded (maybe they used an old kit?) can I still use DCC/CircTest for differential expression of circRNAs?

Best,

Dave

tjakobi commented 5 years ago

Dear @knuppd,

Yes, you may still use DCC for analyzing the data set. Please let me know how results look for the -N options.

Cheers, Tobias

knuppd commented 5 years ago

Hi @tjakobi,

I have finished running DCC with the -N flag. It appears to have fixed the problem, I am not detecting any duplicate lines in the file. I appreciate the help!

Best,

Dave

tjakobi commented 5 years ago

Hi @knuppd,

okay, perfect. I will close this issue here, please do not hesitate to open a new issue if you face any problems.

Cheers, Tobias