Closed covingto closed 8 years ago
Do you have some example os samples that this happens on?
This is a CHOL directory:
ll data/CHOL/0775583e-c0a0-4f18-9ca2-8f89cedce3d6_01_11/
total 59712
-rw-rw-r-- 1 covingto can 196106 Mar 14 01:01 7234ada7a96d7b51e4adf5622f06255a.muse.tcga_filtered.reheadered.vcf
-rw-rw-r-- 1 covingto can 54737791 Mar 14 01:01 7234ada7a96d7b51e4adf5622f06255a.mutect.tcga_filtered.reheadered.vcf
-rw-rw-r-- 1 covingto can 134585 Mar 14 01:01 7234ada7a96d7b51e4adf5622f06255a.pindel.somatic.tcga_filtered.reheadered.vcf
-rw-rw-r-- 1 covingto can 149918 Mar 14 01:01 7234ada7a96d7b51e4adf5622f06255a.SomaticSniper.annotated.tcga_filtered.reheadered.vcf
-rw-rw-r-- 1 covingto can 331483 Mar 14 01:01 7234ada7a96d7b51e4adf5622f06255a.varscan.indel.annotated.tcga_filtered.reheadered.vcf
-rw-rw-r-- 1 covingto can 2103318 Mar 14 01:01 7234ada7a96d7b51e4adf5622f06255a.varscan.snp.annotated.tcga_filtered.reheadered.vcf
-rw-rw-r-- 1 covingto can 79550 Mar 14 22:25 filtered.radia.tcga_filtered.reheadered.vcf
-rw-rw-r-- 1 covingto can 48915 Mar 14 22:25 TCGA_MC3.TCGA-4G-AAZO-01A-12D-A417-09.cleaned.muse.tcga_filtered.reheadered.vcf
-rw-rw-r-- 1 covingto can 135143 Mar 14 22:25 TCGA_MC3.TCGA-4G-AAZO-01A-12D-A417-09.cleaned.pindel.somatic.tcga_filtered.reheadered.vcf
-rw-rw-r-- 1 covingto can 96189 Mar 14 22:25 TCGA_MC3.TCGA-4G-AAZO-01A-12D-A417-09.cleaned.SomaticSniper.annotated.tcga_filtered.reheadered.vcf -rw-rw-r-- 1 covingto can 336028 Mar 14 22:25 TCGA_MC3.TCGA-4G-AAZO-01A-12D-A417-09.cleaned.varscan.indel.annotated.tcga_filtered.reheadered.vcf -rw-rw-r-- 1 covingto can 1796255 Mar 14 22:25 TCGA_MC3.TCGA-4G-AAZO-01A-12D-A417-09.cleaned.varscan.snp.annotated.tcga_filtered.reheadered.vcf
only one mutect and one radia but the others have two files. I guess I could take the younger one, but what do you think?
The duplicated studies actually have different files. If you take a look at the freeze list, both the initial freeze list and the second freeze list both contain this patient and samples. However, the files are different. It appears that while the data was in the initial freeze, it was then co cleaned and put into the second freezelist as well.
One problem here, is that certain files will end up overwriting each other. For example, the radia files are the same name, so whichever file was downloaded last when I did this will overwrite the one downloaded first. I think the right way to do this is to amend the freeze lists. Find all samples which were in both the first and second lists, remove from the first list, and regenerate the pile of vcfs.
OK that sounds reasonable. Right now I'm skipping these for MAF merger. Is this something that you can do or do we need to get the rest of the group involved to get you the freeze lists?
Thanks Kyle C
On Tue, Apr 5, 2016 at 11:01 AM, singerma notifications@github.com wrote:
The duplicated studies actually have different files. If you take a look at the freeze list, both the initial freeze list and the second freeze list both contain this patient and samples. However, the files are different. It appears that while the data was in the initial freeze, it was then co cleaned and put into the second freezelist as well.
One problem here, is that certain files will end up overwriting each other. For example, the radia files are the same name, so whichever file was downloaded last when I did this will overwrite the one downloaded first. I think the right way to do this is to amend the freeze lists. Find all samples which were in both the first and second lists, remove from the first list, and regenerate the pile of vcfs.
— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/covingto/pancanmafmerge/issues/3#issuecomment-205870678
I have the freeze lists, I will figure out what lines to remove and email the larger group to make sure that those files are okay to be removed.
Dropped in place of driving the merger with specific manifest files. So no real need for a glob pattern.
glob patterns in run-batch.py don't cover all cases (line 46 in dispatch function). Some studies seem to include two types of files for each caller. glob pattern is likely too simple, may need to replace with more logic to isolate the correct / best file to use. The dispatch function must generate two lists, one of vcf files and one of caller tags to be submitted to merge.py