Error at the merging step

aditisk commented 4 years ago

Dear @aradenbaugh,

I'm a new user and I was trying to follow the steps in the user guide. I am getting an error while using the "mergeChroms.py" script. I have multiple filtered VCF files per sample but when I try to merge them, the merged VCF file is empty. I tried with and without the --gzip argument, both lead to the same issue. I looked at the individual VCFs are they look okay and seem to have the expected fields.

Is there something I am missing or doing incorrectly ? Thanks.

aradenbaugh commented 4 years ago

It sounds like the mergeChroms.py script is not able to find the individual files. Can you send me the command that you are using? Can you make sure that the path that you provide ends with a forward slash? For example: instead of /path/to/filtered/files use /path/to/filtered/files/

aditisk commented 4 years ago

Here is the command I am using:

python mergeChroms.py HN01 /zfs1/rferris/adk85/Zandberg_0001-WES/filtered/ /zfs1/rferris/adk85/Zandberg_0001-WES/filtered/

I tried adding the log argument and set it to INFO and here is the output of that if it helps:

05/10/2020 11:22:53 PM INFO Total time for Id HN01: Total time=3.2643477122e-07 hrs, 1.95860862732e-05 mins, 0.00117516517639 secs

aditisk commented 4 years ago

I think I figured out what might be the issue. According to the instructions in the user guide, the mergeChroms.py is supposed to merge all file with this naming pattern: patientId_chr*.vcf

However, in the previous filtering step, all my files have the format patientdID_{dbSNP}_chr*.vcf which may be why the mergeChroms.py is having trouble locating the files ?

aradenbaugh commented 4 years ago

Yes, mergeChroms.py just looks for files with the pattern patiendId_chr*.vcf.

Are you saying that the only files that are output from the filtering step are files with the format patientID_{dbSNP}_chr*.vcf? The filterRadia.py script is a wrapper script that does many filtering steps in succession. You can deactivate many of the filtering steps if you want, but the last required filtering step outputs a file in the format that is expected by mergeChroms.py (patientId_chr*.vcf). Can you send me the filterRadia.py command that you used? Are you only applying the dbSNP filter (i.e. you disabled the other filters)? Was there possibly an error from the filterRadia.py step? If it doesn't appear like an error occurred in the filterRadia.py step, but you are not seeing a file named patientId_chr*.vcf, then try running filterRadia.py with the DEBUG flag set. That should give us an idea where the command failed.

aditisk commented 4 years ago

I'm sorry, I just used the dbSNP filter as an example. All the files are named in a similar way patientID_{filter}_chr*.vcf. I used all filters except the Darned and Radar (since I couldn't find these in the hg38 data folder). So I have filtered files for the remaining filters but the file format is not compatible with what the mergeChroms.py supports.

Is the filtering step supposed to name the files as patientID_{filter}_chr.vcf ? My output files were in this is format and it doesn't seem to be compatible with the merging step. Renaming the files to fit the patientID_chr.vcf format is one option. Is there a way to enforce this pattern as an output at the filtering step so I don't have to rename the files ?

aradenbaugh commented 4 years ago

Yes, the last required filtering step outputs a file in the format that is expected by mergeChroms.py, namely the patientId_chr*.vcf format. You shouldn't have to rename the files. If this file doesn't exist in the output directory, then it seems like there is an error occurring in the last filtering step by filterRadia.py. Did you by chance specify a different output file name in the filterRadia.py command using the -o option? Can you run the filtering step on one of the smaller chromosomes and activate the DEBUG flag and send me the command you used and the output? This will help narrow down the issue.

aditisk commented 4 years ago

Thanks for the suggestion about the DEBUG flag. I realized that the COSMIC database isn't already provided for the hg38 due to the license agreement. So before sending you the output, I'm going to download the data and make the BED files for each chromosome.

There are multiple COSMIC files available for download, which one do you recommend using ? The "CosmicGenomeScreensMutantExport.tsv" seems to be the one compatible with exome sequencing but I'm not sure if that is the correct one. Thanks for your patience and help with this.

aradenbaugh commented 4 years ago

It seems like you can use CosmicCompleteTargetedScreensMutantExport.tsv.gz for targeted screens and CosmicGenomeScreensMutantExport.tsv.gz for whole genome screens. Even with exome targeted sequencing, we do see reads aligned outside of the exome. If you are interested in any of these regions, go ahead and get the whole genome screens. If you are only interested in the transcriptome, then the targeted screens should be fine.

aditisk commented 4 years ago

I was finally able to get the expected output. As you pointed out, the error was actually at the filtering step rather than merging. Thank you so much for your help in figuring this out.

aradenbaugh / radia

Error at the merging step #10