egaffo / circompara2

Improved bioinformatic pipeline to identify and quantify circRNA expression from RNA-seq data by combining multiple circRNA detection methods
Other
7 stars 0 forks source link

Is circompara2 allowing for STAR-generated bam file as input? #13

Open lovebaboon1989 opened 1 year ago

lovebaboon1989 commented 1 year ago

Hi there, I'd like to try using circompara2 to map circRNA reads in human genome for my RNAseq project. But since my project has over 1000 samples and we've already done the alignment work for mRNA by STAR aligner (which took lots of time), so I just wonder whether circompara2 may allow for the STAR-generated BAM file as a input? Because if we rerun the genome alignment pipeline using pair-end fastq files once again, that could cost extra a lot of time.. Thanks a lot and looking forward to your kind reply! Best regards, Weiqian Jiang

egaffo commented 1 year ago

You can use the BYPASS option; set it in the vars.py configuration file. You may also want to set the LINMAPS parameter. More in detail, when you already have linear mappings, first, you need to get the (linearly) unmapped reads. Extract them from the BAM using samtools if you did not save them as FASTQ files. Then, you have two choices with CirComPara2: 1) compute only the circRNA expression abundance, or 2) use your BAMs to compute also the data required for the circular-to-linear proportion calculation.

In case 1), you just put the unmapped reads as input in the meta.csv file and set BYPASS = 'linear,linmap' in the vars.py file. With such a setting, CirComPara2 will skip the linear transcripts quantification and linear read alignment steps, performing the circRNA detection and quantification solely. I assume you are not interested in using CirComPara2 also to analyse traditional gene expression; otherwise, set BYPASS = 'linmap' (let’s call it case 1a).

In case 2), you need to set both BYPASS and LINMAPS parameters as follows: BYPASS = 'linear,linmap' (or just 'linmap' in case 1a); LINMAPS must list the BAM’s path for each sample (mind that you also need the .bai BAM index files in the same location of each BAM) and the format is as a Python dictionary. For instance, once, I split my analysis into two runs, one for the “linear” and one for the “circular” parts. For the ‘linear’ analysis, I set BYPASS = ‘circular’ (that's all). Then, in the subsequent run, I analysed the ‘circular’ fraction and set: a) meta.csv with the unmapped read FASTQ files obtained in the first run (i.e.

sample,file
CRR026003,/reads/Mmul_linmap/samples/CRR026003/processings/unmapped_reads/unmapped_1.fastq.gz
CRR026003,/reads/Mmul_linmap/samples/CRR026003/processings/unmapped_reads/unmapped_2.fastq.gz

Note that Mmul_linmap is the directory of the first run of CirComPara2);

b) BYPASS = 'linear,linmap' in the vars.py file, and

c) LINMAPS = "{'CRR026003':'/reads/Mmul_linmap/samples/CRR026003/processings/hisat2_out/CRR026003_hisat2.bam'}” also in the vars.py file. Your LINMAPS parameter will have 1000 entries.

If your reads are stranded, and you want circRNAs with strand info, you need to set the HISAT2_EXTRA_PARAMS (f.i. HISAT2_EXTRA_PARAMS = '--rna-strandness RF') as this parameter controls all strandness-related stuff in CirComPara2 even when linmaps are skipped (I should rename this parameter).

Moreover, if your unmapped reads were not available as paired-end, you might want to also set CIRC_MAPPING = 'SE'in the vars.py file because all your in put reads are considered as single-ended.

Remember you can perform a “dryrun” with CirComPara2's -n option (in the command line) to check that the files can be read by CirComPara2 before actually running the analysis.

However, I've never tried using BAMs generated with anything other than HISAT2.

Hope it helps.

lovebaboon1989 commented 1 year ago

Hi Egaffo, thanks a lot for the elaborate answer, I just followed your advice in case1 (skip all linear steps and only detect and quantify circRNA) and found that we did have unmapped pair-end reads generated by STAR in fastq.gz format as follows, and I also 1. updated BYPASS = 'linear,linmap' in var.py; 2. updated the unmapped reads files in meta.csv. (I guess I don't need to specify bam files or .bai BAM index files in case1 ?)

unmapped read 1 (.fastq.gz) looks like this: @A00453:297:H3GJJDSX3:1:1101:16550:1282 00 GCAGTTTTTCTCCTAGAAATCATTTGAGGGTATTTGCTTTAAGTTGATTGGAAAAATATGGCATAACTGTTTGCACAAACTTGGGACAAATGATATTGGG + FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF @A00453:297:H3GJJDSX3:1:1101:7862:1297 00 GGCAGAAGGAGTGAGATCCAGACTATCCCAGGAAGGAACTGGGGCCTCTGGAAAACCAGAGAGGTTTTCCTCCCCTGAAGGGGTTTTGTGGCACCGTCCCT + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFF

var.py looks like this: META = 'meta.csv' GENOME_FASTA = '../annotation/Homo_sapiens.GRCh38.dna.primary_assembly.fa' ANNOTATION = '../annotation/Homo_sapiens.GRCh38.109.gtf' CPUS = '4' BYPASS = 'linear,linmap'

However, I just got the following error: AttributeError: 'collections.defaultdict' object has no attribute 'abspath': File "/circompara2/src/sconstructs/main.py", line 519: exports = '''env_circpipe get_matching_nodes''') File "/circompara2/tools/scons/scons-local-3.1.2/SCons/Script/SConscript.py", line 660: return method(*args, *kw) File "/circompara2/tools/scons/scons-local-3.1.2/SCons/Script/SConscript.py", line 597: return _SConscript(self.fs, files, **subst_kw) File "/circompara2/tools/scons/scons-local-3.1.2/SCons/Script/SConscript.py", line 286: exec(compile(scriptdata, scriptname, 'exec'), call_stack[-1].globals) File "/circompara2/src/sconstructs/circpipe.py", line 180: 'BAM_INDEX': env['LINMAPS'].abspath + '.bai'}

Do you know how to solve this issue? Thanks a lot!

egaffo commented 1 year ago

You are right; there's no need to specify the .bai files; their paths are inferred from the BAM file path. The reads look ok; they are in standard FASTQ format.

Perhaps you made some typos in writing the vars.py file. Please, check: a) you used double quotation marks to wrap the curly braces and single quotation marks inside the curly braces for the LINMAPS parameter. b) Also, use absolute paths, not relative paths, to specify the BAM locations. c) sample names you used in LINMAPS must match sample names in meta.csv (check carefully for typos). Are you trying with one sample (for testing before analysing all your 1000 samples)? Can you try using two or more samples (for testing)?

lovebaboon1989 commented 1 year ago

Hi Egaffo, thanks for the quick response. I further added the LINMAPS parameter to specify bam files in absolute path (also tried relative path) for each sample in var.py, and double checked with all the sample names as well as paths in meta.csv, but still got error, looking forward to your advice! My var.py looks like this (ignoring all commented lines):

META = 'meta.csv' GENOME_FASTA = '../annotation/Homo_sapiens.GRCh38.dna.primary_assembly.fa' ANNOTATION = '../annotation/Homo_sapiens.GRCh38.109.gtf' CPUS = '4' BYPASS = "linear,linmap" LINMAPS = "{'RS-03774720_122880_RS-03668270_S2':'/data/klengelresslerR01/batch1/bcbio_rnaseq/R01_bcbio_batch1_S1-S96/work/align/RS-03774720_122880_RS-03668270_S2/RS-03774720_122880_RS-03668270_S2_star/RS-03774720_122880_RS-03668270_S2.bam', 'RS-03774719_525681_RS-03668269_S1':'/data/klengelresslerR01/batch1/bcbio_rnaseq/R01_bcbio_batch1_S1-S96/work/align/RS-03774719_525681_RS-03668269_S1/RS-03774719_525681_RS-03668269_S1_star/RS-03774719_525681_RS-03668269_S1.bam', 'RS-03774721_520065_RS-03668271_S3':'/data/klengelresslerR01/batch1/bcbio_rnaseq/R01_bcbio_batch1_S1-S96/work/align/RS-03774721_520065_RS-03668271_S3/RS-03774721_520065_RS-03668271_S3_star/RS-03774721_520065_RS-03668271_S3.bam', 'RS-03774722_538118_RS-03668272_S4':'/data/klengelresslerR01/batch1/bcbio_rnaseq/R01_bcbio_batch1_S1-S96/work/align/RS-03774722_538118_RS-03668272_S4/RS-03774722_538118_RS-03668272_S4_star/RS-03774722_538118_RS-03668272_S4.bam', 'RS-03774723_525378_RS-03668273_S5':'/data/klengelresslerR01/batch1/bcbio_rnaseq/R01_bcbio_batch1_S1-S96/work/align/RS-03774723_525378_RS-03668273_S5/RS-03774723_525378_RS-03668273_S5_star/RS-03774723_525378_RS-03668273_S5.bam'}”

GENOME_INDEX = "../indexes/hisat2/"

SEGEMEHL_INDEX = "../indexes/segemehl/Homo_sapiens.GRCh38.dna.primary_assembly.idx"

BWA_INDEX = "../indexes/bwa/"

BOWTIE2_INDEX = "../indexes/bowtie2/"

BOWTIE_INDEX = "../indexes/bowtie/"

STAR_INDEX = "../indexes/star/Homo_sapiens.GRCh38.dna.primary_assembly/"

And the error information is as follows:

File "", line 6

LINMAPS = "{'RS-03774720_122880_RS-03668270_S2':'/data/klengelresslerR01/batch1/bcbio_rnaseq/R01_bcbio_batch1_S1-S96/work/align/RS-03774720_122880_RS-03668270_S2/RS-03774720_122880_RS-03668270_S2_star/RS-03774720_122880_RS-03668270_S2.bam', 'RS-03774719_525681_RS-03668269_S1':'/data/klengelresslerR01/batch1/bcbio_rnaseq/R01_bcbio_batch1_S1-S96/work/align/RS-03774719_525681_RS-03668269_S1/RS-03774719_525681_RS-03668269_S1_star/RS-03774719_525681_RS-03668269_S1.bam', 'RS-03774721_520065_RS-03668271_S3':'/data/klengelresslerR01/batch1/bcbio_rnaseq/R01_bcbio_batch1_S1-S96/work/align/RS-03774721_520065_RS-03668271_S3/RS-03774721_520065_RS-03668271_S3_star/RS-03774721_520065_RS-03668271_S3.bam', 'RS-03774722_538118_RS-03668272_S4':'/data/klengelresslerR01/batch1/bcbio_rnaseq/R01_bcbio_batch1_S1-S96/work/align/RS-03774722_538118_RS-03668272_S4/RS-03774722_538118_RS-03668272_S4_star/RS-03774722_538118_RS-03668272_S4.bam', 'RS-03774723_525378_RS-03668273_S5':'/data/klengelresslerR01/batch1/bcbio_rnaseq/R01_bcbio_batch1_S1-S96/work/align/RS-03774723_525378_RS-03668273_S5/RS-03774723_525378_RS-03668273_S5_star/RS-03774723_525378_RS-03668273_S5.bam'}”

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     ^

SyntaxError: EOL while scanning string literal

egaffo commented 1 year ago

In the LINMAPS string, the closing double quotes ” does not look like an ASCII character ". Replace it with a simple ". The error suggests the string was not correctly ended, probably because of that character. Avoid non-ASCII characters in the configuration files.

lovebaboon1989 commented 1 year ago

Hi egaffo, thanks. Now I corrected the double quotes but got an error again when using unmapped2anchors.py, do you know what might be the reason for this: unmapped2anchors.py -Q <( zcat samples/RS-03774719_525681_RS-03668269_S1/processings/circRNAs/RS-03774719_525681_RS-03668269_S1.unmappedSE.fq.gz ) | bowtie2 --seed 123 -p 4 --reorder --score-min=C,-15,0 -q -x /home/dbs/indexes/indexes/bowtie2/Homo_sapiens.GRCh38.dna.primary_assembly -U - 2> samples/RS-03774719_525681_RS-03668269_S1/processings/circRNAs/find_circ_out/bt2_secondpass.log | find_circ.py -G /annotation/Homo_sapiens.GRCh38.dna.primary_assembly.fa -p RS-03774719_525681_RS-03668269S1 -s samples/RS-03774719_525681_RS-03668269_S1/processings/circRNAs/find_circ_out/find_circ.log -R samples/RS-03774719_525681_RS-03668269_S1/processings/circRNAs/find_circ_out/sites.reads > samples/RS-03774719_525681_RS-03668269_S1/processings/circRNAs/find_circ_out/sites.bed scons: building terminated because of errors.

Building a SMALL index ERROR:root:Unhandled exception raised while processing alignment pair 0, on input starting at SAM line 0, FASTQ line 0 Traceback (most recent call last): File "/circompara2/src/utils/bash/../../../bin/find_circ.py", line 669, in read = A.qname.split('__')[1] IndexError: list index out of range Traceback (most recent call last): File "/circompara2/src/utils/bash/../../../bin/unmapped2anchors.py", line 229, in handle_read(read) File "/circompara2/src/utils/bash/../../../bin/unmapped2anchors.py", line 173, in handle_read print "@%s_A__%s" % (read.qname,seq) IOError: [Errno 32] Broken pipe

gzip: stdout: Broken pipe scons: *** [samples/RS-03774719_525681_RS-03668269_S1/processings/circRNAs/find_circ_out/sites.bed] Error 1

egaffo commented 1 year ago

Did you run out of disk space? Check also tmp directory...or if you have a shared filesystem on a network it might have gono down. The last IO error tells you there was some problem with reading/writing files

lovebaboon1989 commented 1 year ago

Nope, there is remaining 30 TB on my server so it should not be the reason of limited space. I guess is it because I didn't input the correct bam files? Because I have sample1.bam, sample1.nsorted.bam, sample1.nsorted.primary.bam, sample1.transcriptome.bam, and I just used sample1.bam since I think it should be the sorted bam file.

lovebaboon1989 commented 1 year ago

image This is the what the bam file I used looks like.

egaffo commented 1 year ago

The BAM you are using should not be involved with this error. Instead, did you use pre-computed genome indexes? I guess not, at least for BWA, Bowtie 1 and 2, since you left the parameters commented in the vars.py. It could be that the Bowtie2 index didn't finish being computed, but find_circ started the analysis. So, find_circ used a corrupted index. Can you try generating the indexes before running the analysis and setting them as parameters? Also, check the file samples/RS-03774719_525681_RS-03668269_S1/processings/circRNAs/RS-03774719_525681_RS-03668269_S1.unmappedSE.fq.gz is not empty or bad formatted (it should be just the concatenation of the input read files).

lovebaboon1989 commented 1 year ago

Yes, I just used pre-computed genome indexes from star and SEGEMEHL (generated by circompara2 when I use all default settings to run 2 samples), because for all the remaining index files: bwa, bowtie, bowtie2, hisat2, if I allocate those index files to the pre-computed folders in var.py, there was always error as follows so I only use star and segemehl index files: image image

And I also checked the unmapped.fq.gz file and it seems to look good: image

Maybe if we could solve the bowtie2 index problem, maybe this issue can be solved?

egaffo commented 1 year ago

You missed one part of the index string. You must add the path with the prefix of the index files for BWA, HISAT2 and the Bowties. E.g: GENOME_INDEX = "../indexes/hisat2/Homo_sapiens.GRCh38.dna.primary_assembly" SEGEMEHL_INDEX = "../indexes/segemehl/Homo_sapiens.GRCh38.dna.primary_assembly.idx" BWA_INDEX = "../indexes/bwa/Homo_sapiens.GRCh38.dna.primary_assembly" BOWTIE2_INDEX = "../indexes/bowtie2/Homo_sapiens.GRCh38.dna.primary_assembly" BOWTIE_INDEX = "../indexes/bowtie/Homo_sapiens.GRCh38.dna.primary_assembly" STAR_INDEX = "../indexes/star/Homo_sapiens.GRCh38.dna.primary_assembly"

lovebaboon1989 commented 1 year ago

Hi Egaffo, all the index files setting is working now, but I still have the following similar errors when using find_circ_out:

ERROR:root:Unhandled exception raised while processing alignment pair 0, on input starting at SAM line 0, FASTQ line 0 Traceback (most recent call last): File "/circompara2/src/utils/bash/../../../bin/find_circ.py", line 669, in read = A.qname.split('__')[1] IndexError: list index out of range Traceback (most recent call last): File "/circompara2/src/utils/bash/../../../bin/unmapped2anchors.py", line 229, in handle_read(read) File "/circompara2/src/utils/bash/../../../bin/unmapped2anchors.py", line 173, in handle_read print "@%s_A__%s" % (read.qname,seq) IOError: [Errno 32] Broken pipe

gzip: stdout: Broken pipe scons: *** [samples/RS-03774719_525681_RS-03668269_S1/processings/circRNAs/find_circ_out/sites.bed] Error 1

Again for your information, here is my var.py: META = 'meta.csv' GENOME_FASTA = '../annotation/Homo_sapiens.GRCh38.dna.primary_assembly.fa' ANNOTATION = '../annotation/Homo_sapiens.GRCh38.109.gtf' CPUS = '4' BYPASS = "linear,linmap" LINMAPS = "{'RS-03774720_122880_RS-03668270_S2':'/data/klengelresslerR01/batch1/bcbio_rnaseq/R01_bcbio_batch1_S1-S96/work/align/RS-03774720_122880_RS-03668270_S2/RS-03774720_122880_RS-03668270_S2_star/RS-03774720_122880_RS-03668270_S2.bam', 'RS-03774719_525681_RS-03668269_S1':'/data/klengelresslerR01/batch1/bcbio_rnaseq/R01_bcbio_batch1_S1-S96/work/align/RS-03774719_525681_RS-03668269_S1/RS-03774719_525681_RS-03668269_S1_star/RS-03774719_525681_RS-03668269_S1.bam', 'RS-03774721_520065_RS-03668271_S3':'/data/klengelresslerR01/batch1/bcbio_rnaseq/R01_bcbio_batch1_S1-S96/work/align/RS-03774721_520065_RS-03668271_S3/RS-03774721_520065_RS-03668271_S3_star/RS-03774721_520065_RS-03668271_S3.bam', 'RS-03774722_538118_RS-03668272_S4':'/data/klengelresslerR01/batch1/bcbio_rnaseq/R01_bcbio_batch1_S1-S96/work/align/RS-03774722_538118_RS-03668272_S4/RS-03774722_538118_RS-03668272_S4_star/RS-03774722_538118_RS-03668272_S4.bam', 'RS-03774723_525378_RS-03668273_S5':'/data/klengelresslerR01/batch1/bcbio_rnaseq/R01_bcbio_batch1_S1-S96/work/align/RS-03774723_525378_RS-03668273_S5/RS-03774723_525378_RS-03668273_S5_star/RS-03774723_525378_RS-03668273_S5.bam'}"

GENOME_INDEX = "../indexes/hisat2/Homo_sapiens.GRCh38.dna.primary_assembly" SEGEMEHL_INDEX = "../indexes/segemehl/Homo_sapiens.GRCh38.dna.primary_assembly.idx" BWA_INDEX = "../indexes/bwa/Homo_sapiens.GRCh38.dna.primary_assembly" BOWTIE2_INDEX = "../indexes/bowtie2/Homo_sapiens.GRCh38.dna.primary_assembly" BOWTIE_INDEX = "../indexes/bowtie/Homo_sapiens.GRCh38.dna.primary_assembly" STAR_INDEX = "../indexes/star/Homo_sapiens.GRCh38.dna.primary_assembly/"

Here is my unmapped reads1.fastq.gz looks like: image

And here is what my bam file looks like: image