CDCgov / seqsender

Automated Pipeline to Generate FTP Files and Manage Submission of Sequence Data to Public Repositories
https://cdcgov.github.io/seqsender/
Apache License 2.0
32 stars 11 forks source link

GISAID covCLI bugfixes & BioSample/SRA modifications #64

Closed erikwolfsohn closed 2 months ago

erikwolfsohn commented 3 months ago

Edit: added some BioSample/SRA workflow changes and attached my config.yaml and metadata files in case you wanted to test against them. example_config.yaml.txt example_metadata.csv

Hey Dakota! Thanks for getting this new release out. The new handling for BioSample packages is amazing. I hugely appreciate how much this project simplifies metadata validation across multiple pathogens/packages/repositories and how convenient it makes large volume submissions.

I've tested the BioSample/SRA, and GISAID covCLI workflows - the NCBI workflows worked perfectly, but I ran into a bunch of submission failures testing covCLI.

I made some modifications and now my GISAID submissions are going through reliably. I haven't had time to do any serious testing so I can't say if all these changes will hold up, but I wanted to go ahead and submit a pull request in case there's anything that might be helpful.

The former description of this pull request is a little out of date now. I've been testing GISAID covCLI, SRA, and BioSample heavily, using the SARS-CoV-2 and OneHealth Enteric BioSample packages. Below are changes addressing workflow errors/submission failures I encountered during testing. I want to revisit a few of the changes I made, but hopefully some of them are useful!

⚙️ General

🛠️ covCLI updates/bugfixes

📋 BioSample & SRA updates/bugfixes

Testing data was generated via:

python seqsender.py test_data --biosample --sra --gisaid --organism COV --submission_dir test_data/CCPHL/

Metadata and config templates were created with the Shiny app Submission Wizard

And the workflow was run with this command:

python seqsender.py submit --biosample --sra --gisaid --organism COV --submission_dir test_data/CCPHL/ --submission_name COV_TEST_DATA --config_file test_data/CCPHL/cov_ccphl_config.yaml --metadata_file test_data/CCPHL/meta2.csv --fasta_file test_data/CCPHL/sequence.fasta --test
dthoward96 commented 2 months ago

Hey @erikwolfsohn,

Thanks for all the changes you've contributed! I've incorporated them into the updates I was already working on for the v1.2.1 update and have also expanded upon some of the additions you made as well.

I expanded the try/catch for file permission errors to cover all files being generated by the file_handler.py script. I really liked your changes for GISAID but I made a couple of modifications to eliminate the database prefix issues and changed how the logging occurs as I noticed there could be an error where the incorrect sample name might be used. The bs-description field should be resolved as I also changed how the config file works for the description title and comment. I've split and renamed the "bs-description" field into "bs-sample_title" and "bs-sample_description" with updated documentation in the templates so that should provide a better explanation of those fields. I also provided some additional info in the shiny issue you had created as well.

Thanks, again, -Dakota