DecodeGenetics / BamHash

GNU General Public License v3.0
37 stars 8 forks source link

Check option #3

Open drchriscole opened 9 years ago

drchriscole commented 9 years ago

Like for the std md5sum command it would be really handy to have a 'check' option (-c) in order to check a fastq/bam vs it's bamhash to programmatically verify it's consistency.

Also, it's confusing that fastq read pairs are only counted once whereas they're counted twice in the bam. It would be clearer if the numbers agreed as well as the hashes. Thanks!

dpryan79 commented 9 years ago

I have a branch in my fork that implements essentially this in a program called bamhash_checksum_all. It accepts multiple BAM/fastq files and will print the checksum and count for each and then say whether they differ (it'll also set the exit status accordingly).

drchriscole commented 9 years ago

Thanks, that's neat. However, that wasn't (quite) the behaviour I was after.

With 'md5sum -c', you provide it with a file containing filenames and md5 hashes and it goes through each file checking whether hashes match. It reports OK or Fail for each as it goes along. e.g.

~/tmp> cat md5sums 0bb17fbf22eeb5ff8bc1e1e5401214ef rand1.csv af5d00fd5c1a7e66dfa770c229eb9bac rand2.csv 74346ab1470c74e38d5c91db0b57ea23 rand3.csv ~/tmp> md5sum -c md5sums rand1.csv: OK rand2.csv: OK rand3.csv: FAILED md5sum: WARNING: 1 of 3 computed checksums did NOT match

I wonder if something similar could be acheived here, simply for the bam files? Because once the BAMs are confirmed to be consistent with the fastqs you don't the fastqs anymore. 0bb17fbf22eeb5ff8bc1e1e5401214ef aligned.bam

dpryan79 commented 9 years ago

Ah, I misunderstood. That would indeed be useful.

drchriscole commented 9 years ago

That is not to say that this branch is not a useful addtion. It is. I'd like to see this branch merged with the trunk.