Additional fields - Githubissues

Illumina / GTCtoVCF

Script to convert GTC/BPM files to VCF

Apache License 2.0

42 stars 30 forks source link

Additional fields #41

Closed jjzieve closed 5 years ago

jjzieve commented 5 years ago

This is a rebase from develop and some tweaks to @brianhill11's PR (#20). Tweaks include:

Addressing Ryan's comments from the original PR
Adding a regression test for it.
Adding support for indels in BAlleleFreqFormat
Updating the README

jjzieve commented 5 years ago

@KelleyRyanM ignore my previous comment, I see some of the indel logic in GenotypeFormat, I think I can apply that

KelleyRyanM commented 5 years ago

@jjzieve , Have you or @KyleLevi verified this pull request is still compatible with Python 3? -Ryan

jjzieve commented 5 years ago

@KelleyRyanM Nope, I can check python3 compatibility and address the other comments

jjzieve commented 5 years ago

@KelleyRyanM Ran regressions on python 3.5.6 and it worked. Seemed that was the highest version that pysam 0.9.0 had been built on.

jjzieve commented 5 years ago

@KelleyRyanM Kyle had failed regressions due to precision differences for GQ and LRR with an updated numpy module. I've confirmed with numpy=1.15.2 (I believe he has 1.16.3). i.e. regression ->0/0:0.437028:0.0:0.167821 updated numpy -> 0/0:0.43702784:0.0:0.16782084

Is this acceptable since the README points to one version (1.11)? If not, what's the best way to handle this? This SO post seems to outline the problem: https://stackoverflow.com/questions/4844865/how-to-set-the-precision-on-strnumpy-float64 I'm guessing numpy updated the default precision from 6 to 8 or something somewhere between 1.11 and 1.15

KelleyRyanM commented 5 years ago

@jjzieve , Can't we just format the string when returning from the BAlleleFreq formatter (e.g., return value -> return %.6f' % value)?

jjzieve commented 5 years ago

@KelleyRyanM yes, but its impacting gencall score also, it looks like it can't easily be fixed without re-generating the truth vcfs for the regressions. Also it appears round(value, 6) might be better than format since it doesn't pad 0.0 to 0.000000 (though maybe we want that?). But I can re-gen the vcfs and see how many values change. Most of the gencalls are 6 decimal places but I see some that are 5, 7, etc.

jjzieve commented 5 years ago

@KelleyRyanM and @KyleLevi it impacts 35 loci. Maybe I can move that effort to a separate PR/issue? There could be some usefulness in providing the option to specify the precision of the output? But I think that's a bit out of scope for this PR.

jjzieve commented 5 years ago

Spoke to Ryan, opening a seperate issue for the precision differences with numpy updates.