dieterich-lab / JACUSA2

New version of JACUSA -> 2.0
GNU General Public License v3.0
23 stars 3 forks source link

Usage of the parameter -B #52

Closed 5Tony closed 2 years ago

5Tony commented 2 years ago

Hi,I am a beginning bioinformatics student,I took the liberty of coming to ask you a question,I don't understand the use of the -B parameter.

cmd help:

-B Tag reads by base substitution. Count non-reference base substitution per read and stratify. Requires stranded library type. (Format for T to C mismatch: T2C; use ',' to separate substitutions) Default: none

Here is the code I used:

java -jar JACUSA_v2.0.2-RC.jar call-1 \ -R /home/data/reference/GRCh38.primary_assembly.genome.fa \ -a I,S,B,M,Y \ -B A2G \ -p 10 \ -q 10 \ -P UNSTRANDED \ -r /home/data/t050326/result/test1/test12.txt \ -s \ -w 7500 \ -W 15000 \ /home/data/result/MD/SRR6112594Aligned_sorted2.out.MD.bam

But there is a bug in the operation:

INFO 00:00:01 Thread 5: Working on contig chr1:1005522-1020521 INFO 00:00:02 Thread 3: Working on contig chr1:1064443-1079442 INFO 00:00:02 Thread 4: Working on contig chr1:1098054-1113053 INFO 00:00:02 Thread 4: Working on contig chr1:1228872-1243871 INFO 00:00:02 Thread 4: Working on contig chr1:1315582-1330581 ERROR 00:00:02 Problem with read: SRR6112594.981532 in /home/data/result/MD/SRR6112594Aligned_sorted2.out.MD.bam htsjdk.samtools.SAMFormatException: Multiple SAMRecord with read name SRR6112594.981532 for first end. at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.queryMate(SamReader.java:451) at lib.record.Record.getMate(Record.java:60) at lib.data.storage.readsubstitution.BaseSubRecordProcessor.processPE(BaseSubRecordProcessor.java:179) at lib.data.storage.readsubstitution.BaseSubRecordProcessor.process(BaseSubRecordProcessor.java:238) at lib.data.storage.container.UnstrandedCacheContainter.process(UnstrandedCacheContainter.java:50) at lib.data.assembler.SiteDataAssembler.buildCache(SiteDataAssembler.java:56) at lib.util.ReplicateContainer.createIterators(ReplicateContainer.java:49) at lib.util.ConditionContainer.lambda$0(ConditionContainer.java:29) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1376) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) at lib.util.ConditionContainer.updateWindowCoordinates(ConditionContainer.java:29) at lib.worker.AbstractWorker.updateReservedWindowCoordinate(AbstractWorker.java:170) at lib.worker.AbstractWorker.processInit(AbstractWorker.java:186) at lib.worker.AbstractWorker.run(AbstractWorker.java:218)

Here's some of the code and java version I used:

java -version openjdk version "1.8.0_152-release" OpenJDK Runtime Environment (build 1.8.0_152-release-1056-b12) OpenJDK 64-Bit Server VM (build 25.152-b12, mixed mode)

/home/data/software/STAR-2.7.10a/bin/Linux_x86_64/STAR --runThreadN 6 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJDBoverhangMin 1 --alignSJoverhangMin 8 --alignSoftClipAtReferenceEnds Yes --chimJunctionOverhangMin 15 --chimMainSegmentMultNmax 1 --chimOutType Junctions SeparateSAMold WithinBAM SoftClip --chimSegmentMin 15 --genomeDir /home/data/project/genome/genomeDir_hg38v39_r149 --genomeLoad NoSharedMemory --limitSjdbInsertNsj 1200000 --outFileNamePrefix /home/data/result/STAR/SRR6112594 --outFilterIntronMotifs None --outFilterMatchNminOverLread 0.33 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.1 --outFilterMultimapNmax 20 --outFilterScoreMinOverLread 0.33 --outFilterType BySJout --outSAMattributes NH HI AS XS nM NM ch --outSAMstrandField intronMotif --outSAMtype BAM Unsorted --outSAMunmapped Within --quantMode TranscriptomeSAM GeneCounts --twopassMode Basic --readFilesIn /home/data/result/trim_galore/SRR6112594/SRR6112594_1_val_1.fq /home/data/result/trim_galore/SRR6112594/SRR6112594_2_val_2.fq

samtools sort SRR6112594Aligned.out.bam -o SRR6112594Aligned_sorted.out.bam samtools index SRR6112594Aligned_sorted.out.bam samtools calmd SRR6112594Aligned_sorted.out.bam /home/data/reference/GRCh38.primary_assembly.genome.fa > \ SRR6112594Aligned_sorted.out.MD.bam samtools sort SRR6112594Aligned_sorted.out.MD.bam -o SRR6112594Aligned_sorted2.out.MD.bam samtools index SRR6112594Aligned_sorted2.out.MD.bam

I've changed several sets of data and all have similar bugs. I've been searching for a long time and haven't found out how to solve this, so I'm taking the liberty to ask you, and I hope you can solve the doubts in my mind. @Piechotta

piechottam commented 2 years ago

-B A2G

with -B A2G you define a base substitution that will stratify reads into:

See attached Image -> you can call variants on tagged and untagged reads.

base_stratification

Out of curiosity - what are you trying to achieve?

Bug

The exception/error/bug is thrown by htsjdk - it says that: "Multiple SAMRecord with read name SRR6112594.981532"

How do you generate the BAM index: samtools index ? Or do you use something else.

You could manually process the reads and remove:

with samtools view -F <FLAGS> <input.bam> ... Check Flags for details.

5Tony commented 2 years ago

Thank you very much for your reply.

I want to get the result file after adding the parameter -B A2G, but the program often breaks in running, but it may run normally after excluding the parameter -B A2G, which is distressing me. I've read the manual, but I can't find a suitable solution, so I'm taking the liberty to ask you. Can you provide a sample command to add the -B A2G parameter?


Here are the commands I had:

  1. I downloaded GSE104379 from GEO database,

  2. I went through fasterq-dump to fastq, fasterq-dump -e 16 -p --split-files *.sra

  3. Mapped using STAR,

STAR --runThreadN 6 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJDBoverhangMin 1 --alignSJoverhangMin 8 --alignSoftClipAtReferenceEnds Yes --chimJunctionOverhangMin 15 --chimMainSegmentMultNmax 1 --chimOutType Junctions SeparateSAMold WithinBAM SoftClip --chimSegmentMin 15 --genomeDir /home/data/project/genome/genomeDir_hg38v39_r149 --genomeLoad NoSharedMemory --limitSjdbInsertNsj 1200000 --outFileNamePrefix /home/data/result/STAR/SRR6112594 --outFilterIntronMotifs None --outFilterMatchNminOverLread 0.33 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.1 --outFilterMultimapNmax 20 --outFilterScoreMinOverLread 0.33 --outFilterType BySJout --outSAMattributes NH HI AS XS nM NM ch --outSAMstrandField intronMotif --outSAMtype BAM Unsorted --outSAMunmapped Within --quantMode TranscriptomeSAM GeneCounts --twopassMode Basic --readFilesIn /home/data/result/trim_galore/SRR6112594/SRR6112594_1_val_1.fq /home/data/result/trim_galore/SRR6112594/SRR6112594_2_val_2.fq

  1. Then sorted, indexed and calmd using samtools.

samtools sort -l 0 -@ 25 -m 1G SRR6112594Aligned.out.bam -o SRR6112594Aligned_sorted.out.bam

samtools index -b SRR6112594Aligned_sorted.out.bam

samtools calmd -@10 SRR6112594Aligned_sorted.out.bam GRCh38.primary_assembly.genome.fa > \ SRR6112594Aligned_sorted.out.MD.bam

samtools sort -l 0 -@ 25 -m 1G SRR6112594Aligned_sorted.out.MD.bam -o SRR6112594Aligned_sorted_2.out.MD.bam

samtools index -b SRR6112594Aligned_sorted_2.out.MD.bam

  1. Used JACUSA.

java -jar JACUSA_v2.0.2-RC.jar call-1 -R /home/data/reference/GRCh38.primary_assembly.genome.fa -a I,S,B,M,Y -B A2G -p 10 -q 10 -P UNSTRANDED -r /home/data/t050326/result/test1/test12.txt -s -w 7500 -W 15000 /home/data/result/MD/SRR6112594Aligned_sorted2.out.MD.bam

  1. The following results appeared:

INFO 00:00:02 Thread 9: Working on contig chr1:3737194-3752193 INFO 00:00:02 Thread 1: Working on contig chr1:3780675-3795674 INFO 00:00:02 Thread 1: Working on contig chr1:3812106-3827105 INFO 00:00:02 Thread 1: Working on contig chr1:3836232-3851231 INFO 00:00:02 Thread 9: Working on contig chr1:4007396-4022395 INFO 00:00:02 Thread 9: Working on contig chr1:4039783-4054782 ERROR 00:00:02 Problem with read: SRR6112594.981532 in /home/data/t050326/result/MD/SRR6112594Aligned_sorted2.out.MD.bam htsjdk.samtools.SAMFormatException: Multiple SAMRecord with read name SRR6112594.981532 for first end. at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.queryMate(SamReader.java:451) at lib.record.Record.getMate(Record.java:60) at lib.data.storage.readsubstitution.BaseSubRecordProcessor.processPE(BaseSubRecordProcessor.java:179) at lib.data.storage.readsubstitution.BaseSubRecordProcessor.process(BaseSubRecordProcessor.java:238) at lib.data.storage.container.UnstrandedCacheContainter.process(UnstrandedCacheContainter.java:50) at lib.data.assembler.SiteDataAssembler.buildCache(SiteDataAssembler.java:56) at lib.util.ReplicateContainer.createIterators(ReplicateContainer.java:49) at lib.util.ReplicateContainer.updateIterators(ReplicateContainer.java:55) at lib.util.ConditionContainer.lambda$1(ConditionContainer.java:34) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647) at lib.util.ConditionContainer.updateActiveWinCoord(ConditionContainer.java:34) at lib.worker.AbstractWorker.hasNext(AbstractWorker.java:121) at lib.worker.AbstractWorker.processReady(AbstractWorker.java:196) at lib.worker.AbstractWorker.run(AbstractWorker.java:213) INFO 00:00:02 Thread 9: Working on contig chr1:5786107-5801106 INFO 00:00:02 Thread 1: Working on contig chr1:5883045-5898044


I checked the SRR6112594Aligned_sorted2.out.MD.bam using samtools and saw the following, I am again trying to find more ways to solve this problem, thank you. samtools tview -p chr1:1403914 SRR6112594Aligned_sorted2.out.MD.bam /home/data/t050326/reference/GRCh38.primary_assembly.genome.fa

3 28