Ability to run preprocess only on updated files

xiamaz commented 6 years ago

A new config option for only processing new and updated files in AWS has been introduced. This required several changes to logging and output processing.

QC logs (json and case level) will be merged with existing data, if the data is being updated.

config.yml will be merged with new data (although no files will be removed). This might not be necessary, as currently the config.yml is generated entirely from the vcf directories. I am not very sure, whether this is necessary. Why couldn't we just get the files from the vcf directories in simulation?

Fixes #58

la60312 commented 6 years ago

do I have to clean all the file and rerun again?

Downloading files from AWS Bucket fdna-pedia-dump Checking AWS: [####################] 3886/3886 S3 Stats: Downloaded 0 new 3812 updated of 3886 total files. Process jsons: [####################] 2605/2605 Unfiltered 2605 Traceback (most recent call last): File "preprocess.py", line 436, in main() File "preprocess.py", line 402, in main jsons = create_jsons(args, config_data) File "preprocess.py", line 180, in create_jsons qc_failed_results = {olddata, qc_failed_results} TypeError: 'list' object is not a mapping

la60312 commented 6 years ago

I run it from a clean folder, and there is a error at the end.

Generate VCFs: [####################] 712/712 Traceback (most recent call last): File "preprocess.py", line 436, in main() File "preprocess.py", line 421, in main qc_cases = save_vcfs(args, config_data, qc_cases) File "preprocess.py", line 295, in save_vcfs create_config(config_path, simulated, realvcf, merge) File "preprocess.py", line 142, in create_config for k, v in config_data.items() File "preprocess.py", line 142, in for k, v in config_data.items() TypeError: 'NoneType' object is not iterable

la60312 commented 6 years ago

I got this error. Do I have to remove all the json files and mutations?

== Quality check ==
Get pathogenic genes in geneList:       [####################] 675/675
Saving qc log
Saving passing cases to new location
Traceback (most recent call last):
  File "preprocess.py", line 448, in <module>
    main()
  File "preprocess.py", line 439, in main
    args, config_data, qc_cases, old_jsons
  File "preprocess.py", line 398, in quality_check_cases
    save_old_to_qc()
  File "/data/project/pedia5/PEDIA-workflow/lib/visual.py", line 27, in progress_wrapper
    size = len(args[0])
IndexError: tuple index out of range

xiamaz commented 6 years ago

No, you do not have to. I am testing the changes again. It is just very time-consuming to run the entire pipeline.

la60312 commented 6 years ago

Are the two pull request independent? I only test on this one.

xiamaz commented 6 years ago

Yes, I noted that in the other pull request.

la60312 commented 6 years ago

Not merging with old data. Incompatible formats.

I saw this while running on the folder with existed output. What kind of format are different?

xiamaz commented 6 years ago

The old json check log is a list, while the new format is a dict. Logs in old format will not be merged, but will rather be overwritten.

la60312 commented 6 years ago

Get pathogenic genes in geneList:       [####################] 743/743
Saving qc log
Saving passing cases to new location
Traceback (most recent call last):
  File "preprocess.py", line 451, in <module>
    main()
  File "preprocess.py", line 442, in main
    args, config_data, qc_cases, old_jsons
  File "preprocess.py", line 401, in quality_check_cases
    save_old_to_qc(qc_passed)
  File "/data/project/pedia5/PEDIA-workflow/lib/visual.py", line 28, in progress_wrapper
    for i, res in enumerate(iter_func(*args, **kwds)):
  File "preprocess.py", line 397, in save_old_to_qc
    destination=config_data.quality["qc_output_path"]
TypeError: save_json() got an unexpected keyword argument 'destination'

la60312 commented 6 years ago

I run it on the folder with previous results. I have a question about quality_check.log. I removed one case which is 21147 from aws_dir/cases/ and remove it from config.yml.

There is no passed key in quality_check.log. Can I test it by this way?

I also run it in a clean folder, but it is still running.

xiamaz commented 6 years ago

What do you mean by testing?

The changed_only works by comparing the timestamp of the AWS download directory (which is created in the last run) and updates all files with modification times newer than that. So deleting files from results should not work.

la60312 commented 6 years ago

The proper way to test it is to run it tomorrow. Is it correct? I just want to test it today to check if there is another issue. So I delete 21147 in aws_dir/cases/ folder and also the entry in config.yml.

The following is the log of running on whole folder. I got 21147 downloaded via aws and it passed QC. But I can't find the key of "passed" in quality_check.log.

== Process new json files ==
Downloading files from AWS Bucket fdna-pedia-dump
Checking AWS:   [####################] 3952/3952
S3 Stats: Downloaded 1 new 0 updated of 3952 total files.
Process jsons:  [####################] 1/1
Unfiltered 1
Filtered rough criteria 1
== Create cases from new json format ==
Create cases:   [####################] 1/1
Mutalyzer:      [#                   ] 0/0
== Phenomization of cases ==
Phenomization:  [####################] 1/1
== Mapping to old json format ==
Convert old:    [####################] 1/1
Generate VCFs:  [####################] 1/1
== Quality check ==
Get pathogenic genes in geneList:       [####################] 1/1
Saving qc log
Saving passing cases to new location
Save passing qc:        [####################] 1/1
== QC results ==
Passed: 1 Failed: 0

la60312 commented 6 years ago

oh, because you change the format in passed in quality_check.log?

xiamaz commented 6 years ago

Yes, the quality_check.log format has been changed to only contain dicts. This makes case_id based merging much more straightforward. Unfortunately this means, that previous scripts will need to adopt some minor changes.

la60312 commented 6 years ago

it's fine. I will test it again tomorrow with new data. Just need everyone to rerun it with clean folder when we merge it.

xiamaz commented 6 years ago

With the newer changes running on folders with previous data, should simply overwrite files. So rerunning with clean folder is only necessary if one cares about the correctness of the produced logs.

I think F2G is always updating all cases. Will this change in the future? Otherwise these changes are quite useless and my caching implementations will have more reasonable impact.

la60312 commented 6 years ago

I remember they only updated the cases when they have another build. But it seems like they update it everyday now. In fact, I don't know what will change after we publish the paper. We have some discussion about the data transfer issue. They talked about they want to switch to another way which is LAB API. This is only for the real exome case. And I don't know if we will continue the annotation by curator after this.

Anyway, the thing I want to prevent is the generation of VCF. Snakemake will detect if the data is modified or overwritten since last execution. If the preprocessing overwrite the VCF, all the simulation should start over. It takes more than one day to process 700 cases. It involves the process of spiking the mutation in 1KG sample. It's fine if we only change JSON files, it only takes less than one hour to rerun them.

la60312 commented 6 years ago

By the way, can we just check if the VCF file which is converted from hgvs code is in mutations folder? If it is not, then we convert it. I think it is unusual if one passed hgvs will change.

xiamaz commented 6 years ago

By the way, can we just check if the VCF file which is converted from hgvs code is in mutations folder? If it is not, then we convert it. I think it is unusual if one passed hgvs will change.

I am working on this in the caching branch. I feel like, that the changes introduced here are not really necessary, if we have caching in place.

I will cherry-pick some changes into the caching branch and close this one, if this is ok with you.

la60312 commented 6 years ago

thanks.

PEDIA-Charite / PEDIA-workflow

Ability to run preprocess only on updated files #60