dieterich-lab / JACUSA2

New version of JACUSA -> 2.0
GNU General Public License v3.0
23 stars 3 forks source link

JACUSA2 strand problem #27

Closed CDieterich closed 4 years ago

CDieterich commented 4 years ago

Hi Michael,

jacusa.jar Version: 1.3.5 call-2 -s -c 5 -P RF-FIRSTSTRAND,RF-FIRSTSTRAND -p 10 -W 1000000 -F 1024 --filterNM_1 5 --filterNM_2 5 -T 1 -a D,M,Y -u DirMult:showAlpha -R -r call2_RNA_KO K002000225_86972_STARmapping_uniq_rmdup.bam,K002000225_86975_STARmapping_uniq_rmdup.bam,K002000225_86976_STARmapping_uniq_rmdup.bam K002000225_86971_STARmapping_uniq_rmdup.bam,K002000225_86974_STARmapping_uniq_rmdup.bam,K002000225_86973_STARmapping_uniq_rmdup.bam

and

JACUSA2 Version: 2.0.0-RC13 call-2 -F 1024 -c 10 -p 10 -D -I -a D,Y -P1 RF-FIRSTSTRAND -P2 RF-FIRSTSTRAND -r call2_RNA_KO_RC9_call2_result.out K002000225_86972_STARmapping_uniq_rmdup.bam,K002000225_86975_STARmapping_uniq_rmdup.bam,K002000225_86976_STARmapping_uniq_rmdup.bam K002000225_86971_STARmapping_uniq_rmdup.bam,K002000225_86974_STARmapping_uniq_rmdup.bam,K002000225_86973_STARmapping_uniq_rmdup.bam

produce different results on the classical ADAR1 E861A allele.

JACUSA2 says: 3 89746630 89746631 call-2 5.625566079828488 - 0,0,11,16 0,0,11,3 0,0,8,16 0,0,25,0 0,0,24,0 0,0,9,5 T 3 89746631 89746632 call-2 5.6063230362406955 - 11,0,0,15 11,0,0,3 8,0,0,16 25,0,0,0 24,0,0,0 9,0,0,5 T

while JACUSA1.3.5 says: 3 89746630 89746631 variant 5.625566079896998 + 16,11,0,0 3,11,0,0 16,8,0,0 0,25,0,0 0,24,0,0 5,9,0,0 alpha1=11188.384184816841:9601.85412112383:69.8105553602469:69.8105553602469;alpha2=0.5647767609649967:7.290456409137258:0.20697830129410996:0.20697830129410996;alphaP=0.9016251286991446:2.6414091415873204:0.16117799024330678:0.16117799024330678;initAlpha1=0.5346369330406189:0.45873559060709357:0.003313738176143709:0.003313738176143709;initAlpha2=0.08164001235762022:0.9116972755004218:0.00333135607097912:0.00333135607097912;initAlphaP=0.31167095769867736:0.6816842228240553:0.00332240973863364:0.00332240973863364;iterations1=23;iterations2=10;iterationsP=10;logLikelihood1=-47.93415535953375;logLikelihood2=-19.00650741266077;logLikelihoodP=-72.56622885209151 A 3 89746631 89746632 variant 5.606323036284408 + 15,0,0,11 3,0,0,11 16,0,0,8 0,0,0,25 0,0,0,24 5,0,0,9 alpha1=11070.40293653336:69.91107380907921:69.91107380907921:9777.979127320734;alpha2=0.5627590466451253:0.20628169884562636:0.20628169884562636:7.266380686311476;alphaP=0.9128990637495525:0.16174569173203854:0.16174569173203854:2.705863471753122;initAlpha1=0.5275266396288331:0.003309608329803837:0.003309608329803837:0.4658541437115592;initAlpha2=0.08162740582152242:0.0033148272976983003:0.0033148272976983003:0.9117429395830811;initAlphaP=0.3063304358925807:0.00331219729116136:0.00331219729116136:0.6870451695250964;iterations1=23;iterations2=10;iterationsP=10;logLikelihood1=-47.25731272570556;logLikelihood2=-18.986209684475106;logLikelihoodP=-71.84984544646508 A

I think JACUSA 1.3.5 is right !

eboileau commented 4 years ago

I think this issue is related to the following observations.

Steps to reproduce the issue

JACUSA2 call-1 using identical parameters on the same BAM file, once with -P FR-SECONDSTRAND, the other with -P RF-FIRSTSTRAND, yields identical results.

## JACUSA2 Version: 2.0.0-RC13 call-1 -p 12 -P FR-SECONDSTRAND -m 1 -q 20 -c 1 -a D,Y -r FR_107339B_SLAM_1h.tab /prj/MCF7_RNA_turnover_Naarmann/CD30-analysis/mapping/107339B_SLAM_1h.sortdedup.bam
#contig start   end     name    score   strand  bases11 info    filter  ref
1       10559   10560   call-1  1.345028283936324       +       0,0,1,0 *       *       C
1       10622   10623   call-1  2.714893275683302       +       0,2,0,0 *       *       T
1       10648   10649   call-1  1.345028283941954       +       1,0,0,0 *       *       G
1       14114   14115   call-1  0.680230635087355       -       1,0,0,4 *       *       T
1       14247   14248   call-1  2.8159531447923136      -       1,3,0,0 *       *       A
1       14353   14354   call-1  2.7148932756517734      -       0,0,0,2 *       *       G
## JACUSA2 Version: 2.0.0-RC13 call-1 -p 12 -P RF-FIRSTSTRAND -m 1 -q 20 -c 1 -a D,Y -r _107339B_SLAM_1h.tab /prj/MCF7_RNA_turnover_Naarmann/CD30-analysis/mapping/107339B_SLAM_1h.sortdedup.bam
#contig start   end     name    score   strand  bases11 info    filter  ref
1       10559   10560   call-1  1.345028283936324       +       0,0,1,0 *       *       C
1       10622   10623   call-1  2.714893275683302       +       0,2,0,0 *       *       T
1       10648   10649   call-1  1.345028283941954       +       1,0,0,0 *       *       G
1       14114   14115   call-1  0.680230635087355       -       1,0,0,4 *       *       T
1       14247   14248   call-1  2.8159531447923136      -       1,3,0,0 *       *       A
1       14353   14354   call-1  2.7148932756517734      -       0,0,0,2 *       *       G

Steps to reproduce the issue

Same behaviour with JACUSA2 pileup

## JACUSA2 Version: 2.0.0-RC13 pileup -p 12 -m1 1 -q1 20 -c1 1 -P1 RF-FIRSTSTRAND -m2 1 -q2 20 -c2 1 -P2 FR-SECONDSTRAND -a D,Y -r pileup_107339B_SLAM_1h.tab /prj/MCF7_RNA_turnover_Naarmann/CD30-analysis/mapping/107339B_SLAM_1h.sortdedup.bam /prj/MCF7_RNA_turnover_Naarmann/CD30-analysis/mapping/107339B_SLAM_1h.sortdedup.bam
#contig start   end name    stat    strand  bases11 bases21 info    filter  ref
1   10540   10541   pileup  2.0 +   0,1,0,0 0,1,0,0 *   *   C
1   10541   10542   pileup  2.0 +   0,1,0,0 0,1,0,0 *   *   C
1   10542   10543   pileup  2.0 +   0,0,1,0 0,0,1,0 *   *   G
1   10543   10544   pileup  2.0 +   1,0,0,0 1,0,0,0 *   *   A
1   10544   10545   pileup  2.0 +   1,0,0,0 1,0,0,0 *   *   A
1   10545   10546   pileup  2.0 +   1,0,0,0 1,0,0,0 *   *   A
1   10546   10547   pileup  2.0 +   0,0,0,1 0,0,0,1 *   *   T
1   10547   10548   pileup  2.0 +   0,1,0,0 0,1,0,0 *   *   C
1   10548   10549   pileup  2.0 +   0,0,0,1 0,0,0,1 *   *   T
1   10549   10550   pileup  2.0 +   0,0,1,0 0,0,1,0 *   *   G

Version

Additional info

I add one selected examples illustrating what would be the expected behaviour.

My library is RF-FIRSTSTRAND, so I have /2---> <---1/, or mate 2 is from the original strand. For one particular entry, JACUSA2 yields

1       10622   10623   call-1  2.714893275683302       +       0,2,0,0 *       *       T

The matching BAM entries (discarding secondary alignments, I think this is what JACUSA2 does, right?) are:

[mate1] A00620:41:HKMKVDMXX:2:1333:23782:9330   99      1       10541   1       89M2D12M        =       10570   130     CCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGG       FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   MD:Z:19C62T6^GC12  PG:Z:MarkDuplicates      NH:i:4  HI:i:1  jI:B:i,-1       NM:i:4  jM:B:c,-1       nM:i:4  AS:i:178

[mate2] A00620:41:HKMKVDMXX:2:1333:23782:9330   147     1       10570   1       60M2D39M        =       10541   -130    CCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGACGCAGAGAGGCGCGCCGCGCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:     MD:Z:53T6^GC17G21       PG:Z:MarkDuplicates NH:i:4  HI:i:1  jI:B:i,-1       NM:i:4  jM:B:c,-1       nM:i:4  AS:i:178

Here, mate1 is mapped in sense strand, and mate2 is mapped anti-sense, i.e. mate2 originates from the reverse strand. Both are properly mapped in pairs.

I find these mismatches: [mate1] pos=10622, ref=T, base=C (would be reported as A->G) [mate2] pos=10622, ref=A, base=G

This one fragment (2 mates) overlap the "T->C mismatch" reported by JACUSA2.

Incidentally:

Additional files

I have added to the cloud

  1. 107339B_SLAM_1h.sortdedup.bam
  2. output from call-1 and pileup above
eboileau commented 4 years ago

Dear Michael, RC14 seems to be working as expected, so unless @CDieterich sees any other related issue, I'll close this one. Thanks.