RNA-seq and DNA-seq data were mapped to different reference genome

YiweiNiu commented 2 years ago

Hi,

I am using JACUSA2 to detect RNA editing events with paired RNA-seq and DNA-seq data.

However, DNA-seq and RNA-seq data were mapped to different reference genomes. The reference genomes were both hg38 with the same names for major chromosomes, but the reference genome for DNA-seq has some alternative scaffolds and decoy sequences.

Here is the reference genome for DNA-seq: GRCh38_full_analysis_set_plus_decoy_hla.fa, and the reference genome for RNA-seq: Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC92.fasta.

First, I encountered the error: "Sequence Dictionaries of BAM files do not match"

INFO    00:00:00  Computing overlap between sequence records.
htsjdk.samtools.SAMException: Sequence Dictionaries of BAM files do not match /home2/niuyw/project/MEI2/gatkout/HG00315/HG00315.final.bam and /home2/niuyw/Data_processing/STAR/Geuvadis/HG00315.sortedByCoord.q255.rmDup.bam
    at lib.util.AbstractMethod.getSAMSequenceRecords(AbstractMethod.java:247)
    at lib.util.AbstractMethod.initCoordinateProvider(AbstractMethod.java:189)
    at lib.cli.CLI.processArgs(CLI.java:308)
    at lib.util.AbstractTool.run(AbstractTool.java:48)
    at jacusa.JACUSA.main(JACUSA.java:96)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:61)
--------------------------------------------------------------------------------
JACUSA2 Version: 2.0.2-RC call-2 -p 4 -a B,D,H:condition=1,I,S,Y -s -f B -q 35 -r /home2/niuyw/Data_processing/JACUSA2/Geuvadis/HG00315/HG00315.txt /home2/niuyw/project/MEI2/gatkout/HG00315/HG00315.final.bam /home2/niuyw/Data_processing/STAR/Geuvadis/HG00315.sortedByCoord.q255.rmDup.bam
--------------------------------------------------------------------------------
java.lang.NullPointerException
    at lib.worker.WorkerDispatcher.hasNext(WorkerDispatcher.java:60)
    at lib.worker.WorkerDispatcher.run(WorkerDispatcher.java:64)
    at lib.util.AbstractTool.run(AbstractTool.java:60)
    at jacusa.JACUSA.main(JACUSA.java:96)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:61)

Then I replaced the RNA-seq BAM header using the one from the DNA-seq BAM, but the output was empty. Here is log info:

Ignoring SAM validation error: ERROR: Record 1, Read name ERR188023.16833774, RG ID on SAMRecord not found in header: rg1
Ignoring SAM validation error: ERROR: Record 2, Read name ERR188023.9783723, RG ID on SAMRecord not found in header: rg1
Ignoring SAM validation error: ERROR: Record 3, Read name ERR188023.20088878, RG ID on SAMRecord not found in header: rg1
ERROR   00:00:47  ERROR: Read name ERR188023.16833774, RG ID on SAMRecord not found in header: rg1
Ignoring SAM validation error: ERROR: Record 4, Read name ERR188023.22658761, RG ID on SAMRecord not found in header: rg1
ERROR   00:00:47  ERROR: Read name ERR188023.9783723, RG ID on SAMRecord not found in header: rg1
Ignoring SAM validation error: ERROR: Record 5, Read name ERR188023.9065204, RG ID on SAMRecord not found in header: rg1
ERROR   00:00:47  ERROR: Read name ERR188023.20088878, RG ID on SAMRecord not found in header: rg1
Ignoring SAM validation error: ERROR: Record 6, Read name ERR188023.20059829, RG ID on SAMRecord not found in header: rg1
ERROR   00:00:47  ERROR: Read name ERR188023.22658761, RG ID on SAMRecord not found in header: rg1
Ignoring SAM validation error: ERROR: Record 7, Read name ERR188023.16299520, RG ID on SAMRecord not found in header: rg1
ERROR   00:00:47  ERROR: Read name ERR188023.9065204, RG ID on SAMRecord not found in header: rg1
Ignoring SAM validation error: ERROR: Record 8, Read name ERR188023.20056934, RG ID on SAMRecord not found in header: rg1
ERROR   00:00:47  ERROR: Read name ERR188023.20059829, RG ID on SAMRecord not found in header: rg1
Ignoring SAM validation error: ERROR: Record 9, Read name ERR188023.14789090, RG ID on SAMRecord not found in header: rg1
ERROR   00:00:47  ERROR: Read name ERR188023.16299520, RG ID on SAMRecord not found in header: rg1
...

The RNA-seq BAM I used was HG00315.star.GRCh38.Aligned.sortedByCoord.markduplicates.bam, and the DNA-seq BAM I used was HG00315.final.cram (I used samtools to convert CRAM to BAM).

I wonder if there is any workaround for this situation. Thanks a lot for your help.

Bests, Yiwei

piechottam commented 2 years ago

Hi, I know this problem very well - unfortunately there is no way around comparing the sequence directories of BAM files.

Like you suggested, I would replace the sequence header AND remove scaffolds that are NOT part of the new sequence directory.

It seems that by doing so, you replaced the entire header INCLUDING Read Group IDs. You will have to merge the existing Read Group IDs and new sequence header manually...

$ samtools view -H <DNA-BAM> | grep "^@RG" > read_ids_header.sam
$ samtools view -H <RNA-BAM> | grep "^@SQ" > seq_ids_header.sam
...

YiweiNiu commented 2 years ago

Hi, thank you for your quick reply.

I did as you said, and JACUSA2 now ran successfully without any errors.

However, the output was empty. I wonder whether it is because I modified the header of the RNA-seq bam file, but I failed to find any error messages in the log file.

Here is the command I used.

input_DNA=/home2/niuyw/project/MEI2/gatkout/${sample}/${sample}.final.bam
input_RNA=$WORKDIR/STAR/$dataset/${sample}.sortedByCoord.q255.rmDup.bam
out_dir=$WORKDIR/JACUSA2/$dataset/$sample

if [ ! -d $out_dir ]; then mkdir -p $out_dir; fi

# reheader input RNA
$path2samtools reheader $WORKDIR/JACUSA2/$dataset/header_new.sam $input_RNA > $out_dir/tmp.bam
$path2samtools index -@ $PPN $out_dir/tmp.bam

# JACUSA2
$path2java -jar $path2JACUSA2 call-2 -p $PPN -a B,D,H:condition=1,I,S,Y -s -f B -q 35 -r $out_dir/${sample}.txt -F1 1024 $input_DNA $out_dir/tmp.bam

gzip $out_dir/${sample}.txt.filtered

Both HG00315.txt and HG00315.txt.filtered.gz were empty.

I also tried to call RNA editing sites using RNA-seq data alone (call-1 mode), and the output seems normal.

Any clues on this? Thanks again for your help.

piechottam commented 2 years ago

Check Chrs. Please check if all your BAMs files are referring to the same chromosomes, e.g.: 1, 2, 3, ... etc. OR chr1, chr2, ... etc.
Check DNA sample Run call-1 on DNA sample - are there any SNPs called? If not, that is strange!
Try less stringent Run: $ $path2java -jar $path2JACUSA2 call-2 -p $PPN -a B,D -r $out_dir/${sample}.txt $input_DNA $out_dir/tmp.bam

YiweiNiu commented 2 years ago

Hi, thanks a lot for your reply.

Please check if all your BAMs files are referring to the same chromosomes, e.g.:

Yes, both RNA-seq BAM and DNA-seq BAM used the "chr1, chr2..." names. As I mentioned above, they just differ for some alternative scaffolds and decoy sequences.

Run call-1 on DNA sample - are there any SNPs called? If not, that is strange!

I tried this with the command line:

# input
input=/home2/niuyw/project/MEI2/gatkout/${sample}/${sample}.final.bam
out_dir=$WORKDIR/JACUSA2/$dataset/$sample/dna

if [ ! -d $out_dir ]; then mkdir -p $out_dir; fi
$path2java -jar $path2JACUSA2 call-1 -p $PPN -a B,D,I,S,Y -s -f B -q 35 -r $out_dir/${sample}.txt $input

The output was empty, both for the HG00315.txt and HG00315.txt.filtered. That is strange.

Try less stringent

I separated the feature-filtered results in another file using the -s parameter. Since both the output HG00315.txt and feature-filtered file HG00315.txt.filtered were empty, I don't think it was because of stringent filtering.

piechottam commented 2 years ago

Please run: $path2java -jar $path2JACUSA2 pileup -p $PPN -a B,D,I,S,Y -q 35 -r $out_dir/${sample}.txt $input_DNA This should generate a "samtools like pileup" - it the result file is empty, something is wrong with your DNA.bam.

What mapper did you use for mapping DNA reads? JACUSA2 filter reads with mapping mapping quality < 20. "You might need to lower the mapping quality filter "-m1".

YiweiNiu commented 2 years ago

Hi, this data was from the 1KGP project, and they mapped the reads using BWA-MEM.

How do you suggest setting the m1 parameter?

piechottam commented 2 years ago

Do you see reads if you load them up in the IGV?

dieterich-lab / JACUSA2

RNA-seq and DNA-seq data were mapped to different reference genome #56