Closed xiamaz closed 6 years ago
do I have to clean all the file and rerun again?
Downloading files from AWS Bucket fdna-pedia-dump
Checking AWS: [####################] 3886/3886
S3 Stats: Downloaded 0 new 3812 updated of 3886 total files.
Process jsons: [####################] 2605/2605
Unfiltered 2605
Traceback (most recent call last):
File "preprocess.py", line 436, in
I run it from a clean folder, and there is a error at the end.
Generate VCFs: [####################] 712/712
Traceback (most recent call last):
File "preprocess.py", line 436, in
I got this error. Do I have to remove all the json files and mutations?
== Quality check ==
Get pathogenic genes in geneList: [####################] 675/675
Saving qc log
Saving passing cases to new location
Traceback (most recent call last):
File "preprocess.py", line 448, in <module>
main()
File "preprocess.py", line 439, in main
args, config_data, qc_cases, old_jsons
File "preprocess.py", line 398, in quality_check_cases
save_old_to_qc()
File "/data/project/pedia5/PEDIA-workflow/lib/visual.py", line 27, in progress_wrapper
size = len(args[0])
IndexError: tuple index out of range
No, you do not have to. I am testing the changes again. It is just very time-consuming to run the entire pipeline.
Are the two pull request independent? I only test on this one.
Yes, I noted that in the other pull request.
Not merging with old data. Incompatible formats.
I saw this while running on the folder with existed output. What kind of format are different?
The old json check log is a list, while the new format is a dict. Logs in old format will not be merged, but will rather be overwritten.
Get pathogenic genes in geneList: [####################] 743/743
Saving qc log
Saving passing cases to new location
Traceback (most recent call last):
File "preprocess.py", line 451, in <module>
main()
File "preprocess.py", line 442, in main
args, config_data, qc_cases, old_jsons
File "preprocess.py", line 401, in quality_check_cases
save_old_to_qc(qc_passed)
File "/data/project/pedia5/PEDIA-workflow/lib/visual.py", line 28, in progress_wrapper
for i, res in enumerate(iter_func(*args, **kwds)):
File "preprocess.py", line 397, in save_old_to_qc
destination=config_data.quality["qc_output_path"]
TypeError: save_json() got an unexpected keyword argument 'destination'
I run it on the folder with previous results. I have a question about quality_check.log. I removed one case which is 21147 from aws_dir/cases/ and remove it from config.yml.
There is no passed key in quality_check.log. Can I test it by this way?
I also run it in a clean folder, but it is still running.
What do you mean by testing?
The changed_only works by comparing the timestamp of the AWS download directory (which is created in the last run) and updates all files with modification times newer than that. So deleting files from results should not work.
The proper way to test it is to run it tomorrow. Is it correct? I just want to test it today to check if there is another issue. So I delete 21147 in aws_dir/cases/ folder and also the entry in config.yml.
The following is the log of running on whole folder. I got 21147 downloaded via aws and it passed QC. But I can't find the key of "passed" in quality_check.log.
== Process new json files ==
Downloading files from AWS Bucket fdna-pedia-dump
Checking AWS: [####################] 3952/3952
S3 Stats: Downloaded 1 new 0 updated of 3952 total files.
Process jsons: [####################] 1/1
Unfiltered 1
Filtered rough criteria 1
== Create cases from new json format ==
Create cases: [####################] 1/1
Mutalyzer: [# ] 0/0
== Phenomization of cases ==
Phenomization: [####################] 1/1
== Mapping to old json format ==
Convert old: [####################] 1/1
Generate VCFs: [####################] 1/1
== Quality check ==
Get pathogenic genes in geneList: [####################] 1/1
Saving qc log
Saving passing cases to new location
Save passing qc: [####################] 1/1
== QC results ==
Passed: 1 Failed: 0
oh, because you change the format in passed in quality_check.log?
Yes, the quality_check.log format has been changed to only contain dicts. This makes case_id based merging much more straightforward. Unfortunately this means, that previous scripts will need to adopt some minor changes.
it's fine. I will test it again tomorrow with new data. Just need everyone to rerun it with clean folder when we merge it.
With the newer changes running on folders with previous data, should simply overwrite files. So rerunning with clean folder is only necessary if one cares about the correctness of the produced logs.
I think F2G is always updating all cases. Will this change in the future? Otherwise these changes are quite useless and my caching implementations will have more reasonable impact.
I remember they only updated the cases when they have another build. But it seems like they update it everyday now. In fact, I don't know what will change after we publish the paper. We have some discussion about the data transfer issue. They talked about they want to switch to another way which is LAB API. This is only for the real exome case. And I don't know if we will continue the annotation by curator after this.
Anyway, the thing I want to prevent is the generation of VCF. Snakemake will detect if the data is modified or overwritten since last execution. If the preprocessing overwrite the VCF, all the simulation should start over. It takes more than one day to process 700 cases. It involves the process of spiking the mutation in 1KG sample. It's fine if we only change JSON files, it only takes less than one hour to rerun them.
By the way, can we just check if the VCF file which is converted from hgvs code is in mutations folder? If it is not, then we convert it. I think it is unusual if one passed hgvs will change.
By the way, can we just check if the VCF file which is converted from hgvs code is in mutations folder? If it is not, then we convert it. I think it is unusual if one passed hgvs will change.
I am working on this in the caching branch. I feel like, that the changes introduced here are not really necessary, if we have caching in place.
I will cherry-pick some changes into the caching branch and close this one, if this is ok with you.
thanks.
A new config option for only processing new and updated files in AWS has been introduced. This required several changes to logging and output processing.
QC logs (json and case level) will be merged with existing data, if the data is being updated.
config.yml will be merged with new data (although no files will be removed). This might not be necessary, as currently the config.yml is generated entirely from the vcf directories. I am not very sure, whether this is necessary. Why couldn't we just get the files from the vcf directories in simulation?
Fixes #58