Magdoll / cDNA_Cupcake

Miscellaneous collection of Python and R scripts for processing Iso-Seq data
BSD 3-Clause Clear License
257 stars 102 forks source link

Some questions about call_variant in isoPhase #148

Open CrazyHsu opened 3 years ago

CrazyHsu commented 3 years ago

Hi Liz,

I'm using isoPhase to assign haplotypes to isoforms, but I have some questions in reading your source code. Here https://github.com/Magdoll/cDNA_Cupcake/blob/master/phasing/io/MPileUpVariantCaller.py#L135, you used Fisher-exact test to calculate pvalue, but after that why did you use pval *= self.number_of_tests to increase pvalue and then to test for significance?

In my case with smaller read coverage(such as 11 reads, 6 for reference, 5 for SNP), even if the pvalue in Fisher-exact test is significant (less than 0.05), it will also be filtered out because of the larger pvalue after multiplying by number_of_tests. It seems like isoPhase works better with large read coverage (such as 40) to avoid the case I met, but how can I handle the case with smaller read coverage? Thanks.

Best, Feng Xu

Magdoll commented 3 years ago

HI @CrazyHsu ,

The pval *= self.number_of_tests is adjusting more multiple testing. It is not related to the read coverage but the number of Fisher exact tests that are being done.

-Liz

CrazyHsu commented 3 years ago

Hi @Magdoll ,

Yes, my case is there is 11 reads mapped with 7 alleles, and all reference alleles covered by 6 reads and alternative alleles by 5 reads. In this case, the expression stats.fisher_exact([[count, r.clean_cov-count], [exp, r.clean_cov-exp]], alternative='greater') for single allele will be stats.fisher_exact([[5, 11-5], [0.005*11, 11-0.005*11,]], alternative='greater'), and the pvalue is 0.0227 which seems like significant (<0.05). But after adjusting with multiple testing pval *= self.number_of_tests, the pvalue will be 0.1589 (0.0227*7) which is not significant, and then the case will be filtered out.

When it comes to the situation with 41 reads mapped with 21 relate to reference allele and 20 relate to the alternative allele, the pvalue of Fisher exact test will be 5.7329e-08, and after multiplying with number_of_tests, the pvalue will also less than 0.05 which means its significance. So, how can i handle this situation? Thanks.

Feng Xu

Magdoll commented 3 years ago

Hi @CrazyHsu ,

That is a very good point. Mechanistically (while not talking about whether this is statistically a good idea), you can run run_phaser.py with a different p-val cutoff using the parameters --pval_cutoff.

The other is I can add a param to turn off multiple testing and have users user it at their own risk.

Do you have a pref? -Liz

CrazyHsu commented 3 years ago

Hi @Magdoll ,

I'm not very good at statistics, so I don't quite understand the statistical implications of adding multiple tests. As for me, I prefer the latter one without the multiple tests. But I don't know if it is still statistically significant.

Best Feng Xu

Magdoll commented 3 years ago

@CrazyHsu let me think about this and get back to you. I may implement other ways to corect multiple testing that is not as strict as the Bonferroni

CrazyHsu commented 3 years ago

@Magdoll Thank you for your prompt reply and look forward to your new ideas.

Best, Feng Xu

Magdoll commented 3 years ago

Hi @CrazyHsu Please try out the experimental version of IsoPhase that allows for the use of Benjamini-Hochberg procedure instead of Bonferroni correction that should increase the sensitivity of identifying SNPs.

To do this you will need to switch to a diff branch of Cupcake

git clone https://github.com/Magdoll/cDNA_Cupcake.git
git checkout magphase
python setup.py build
python setup.py install

Now there's a new parameter

$ run_phaser.py --help
usage: run_phaser.py [-h] -o OUTPUT_PREFIX --strand {+,-} [--partial_ok]
                     [-p PVAL_CUTOFF] [--bhFDR BHFDR] [-n PLOIDY]
                     fastx_filename sam_filename mpileup_filename read_stat
                     mapping_filename

positional arguments:
  fastx_filename        Input FLNC fasta or fastq
  sam_filename
  mpileup_filename
  read_stat
  mapping_filename

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
  --strand {+,-}
  --partial_ok
  -p PVAL_CUTOFF, --pval_cutoff PVAL_CUTOFF
  --bhFDR BHFDR         FDR to be used for the Benjamini–Hochberg correction.
                        Default: None (not used).
  -n PLOIDY, --ploidy PLOIDY

I recommend running run_phaser.py with --bhFDR 0.01 as a first try, which is setting it so a 1% FDR. If you want to increase sensitivity (but likely to get more false calls), you can increase that number, to say, 0.02, 0.03, 0.05, etc.

Please let me know how this goes. -Liz

--Liz

CrazyHsu commented 3 years ago

Hi @Magdoll , Thanks for your new version, I will try it out and give you feedback.

CrazyHsu commented 3 years ago

Hi @Magdoll , I have tried the new version, and I found some issues. As shown in the IGV as below, you can clearly see at least 8 bases showing a allelic imbalance. Here is the ccs.mpileup file I used. image But after trying the new version, when bhFDR was set to 0.01, it turned out no SNP_FOUND; when bhFDR was set to 0.05, only 6 SNPs found, all these results aren't consistent with the observation.

Then I tried to figure out the reason causing the difference, I found there were many sequencing errors called by positions_to_call (it can also be clearly observed in the picture above) which enlarged self.number_of_tests value. And accordingly, the value of bh_val is scaled down, leading to many allelic SNPs be filtered out. It seems like sequencing error can cause a false negative.

In my next attempt, I made some modifications to MPileUpVariantCaller.py as follows: line 166

odds, pval = stats.fisher_exact([[count, r.clean_cov-count], [exp, r.clean_cov-exp]], alternative='greater')
if pval <= vc.pval_cutoff:      # With this filtration, the sequencing errors position will not be stored in pval_dict.
    pval_dict[(pos, base)] = BHtuple(pval=pval, record=r)

line 175

bh_val = ((rank0+1)/len(keys_pos_base)) * vc.bhFDR    # Only significant positions will be used to adjust bh_val

With the modifications above, I got all the SNPs observed in the IGV with the bh_fdr was set to 0.01.

I don't know if the modifications are statistically reasonable, how do you think of it? Looking forward to your opinion.

Best, Feng Xu

Magdoll commented 3 years ago

Hi @CrazyHsu , I like this change a lot! I'm testing it out now and it should be in the next version of Cupcake in a few days! -Liz

CrazyHsu commented 3 years ago

Thanks for your adoption, looking for your next version!

Feng Xu