FDA-ARGOS / data.argosdb

MIT License
3 stars 7 forks source link

Pond - Wuhan reassembly BCO + FASTA #114

Closed stephenshank closed 1 year ago

stephenshank commented 1 year ago

As discussed several times I've reassembled the FASTA from NC_045512 based on a workflow from a PLOS Pathogens paper by our lab and the Galaxy team.

The main issue with providing a BCO right now is that it's actually a workflow that uses two subworkflows, and as such produces three BCOs. Edit: we decided to abandon the PLOS Pathogens BCO for the time being.

I've had some discussion with @HadleyKing about merging BCOs in such a scenario. I am also trying my modified pipeline to see how well it does. Last resort will be redoing into one big workflow, which seems to be of the least utility to me.

Feedback/discussion welcome.

~~SRA accessions for completeness - Illumina: SRR10903401 SRR10903402 SRR10971381 ONT: SRR10948550 SRR10948474 SRR10902284~~ Edit: we also decided to abandon the PLOS Pathogens SRA accessions for the time being. See below for the desired approach.

rajamazumder commented 1 year ago

Stephanie or Hadley I need a bit more information than what is in this thread? Were all of the above runs pooled and then the assembly was done? That is a bit of an unusual approach. Is there a draft BCO available? I just need the description domain. Or maybe Stephen you can just send me in simple text or bullet points what steps you took?

-- Raja Mazumder, Ph.D. Professor Department of Biochemistry and Molecular Medicine School of Medicine & Health Sciences The George Washington University Ross Hall, Room 540 2300 Eye Street N.W. Washington, DC 20037 Phone office: 202-994-5004 Phone lab: 202-994-3639 Phone dept: 202-994-5311 Fax: 202-994-8974

On Fri, Oct 28, 2022 at 4:31 PM Stephen Shank @.***> wrote:

As discussed several times I've reassembled the FASTA https://data.hyphy.org/web/argos/wuhan-reassembly/wuhan-reassembled.fasta from NC_045512 https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2/ based on a workflow from a PLOS Pathogens paper by our lab and the Galaxy team https://journals.plos.org/plospathogens/article?id=10.1371/journal.ppat.1008643 .

The main issue with providing a BCO right now is that it's actually a workflow that uses two subworkflows, and as such produces three BCOs https://data.hyphy.org/web/argos/wuhan-reassembly/.

I've had some discussion with @HadleyKing https://github.com/HadleyKing about merging BCOs in such a scenario. I am also trying my modified pipeline to see how well it does. Last resort will be redoing into one big workflow, which seems to be of the least utility to me.

Feedback/discussion welcome.

SRA accessions for completeness - Illumina: SRR10903401 SRR10903402 SRR10971381 ONT: SRR10948550 SRR10948474 SRR10902284

— Reply to this email directly, view it on GitHub https://github.com/FDA-ARGOS/data.argosdb/issues/114, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3GISMQVZUPXNYZGSBMFZDWFQZ27ANCNFSM6AAAAAARRMYIN4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

stephenshank commented 1 year ago

@rajamazumder As written/linked above there are three BCOs and if we decide to keep this approach, we'll either need to merge them or get a single one. I agree it's unconventional... but I did not author the study, I am just trying to reproduce it.

I've also ran an assembly pipeline with Megahit (the tool that was used in the original assembly) on the associated SRA accession, located here: https://galaxy.hyphy.org/u/stephenshank/h/wuhan-reassembly---megahit

I'm currently assessing this as an alternative and will provide an update in our weekly meeting today. I consider this a more conventional approach that would yield a single BCO, but I need to determine whether or not it reproduces something close to the original assembly. Happy to answer any further questions if anything is unclear.

stephenshank commented 1 year ago

Megahit BCO is here but again for emphasis I have not yet assessed the fidelity to the original assembly.

If I understand correctly from #113 the description domain should be populated based on the tools due to the work of @HadleyKing .

rajamazumder commented 1 year ago

Stephanie and Hadley can you please add this to the list of things we need to discuss hopefully today? I just need a bit of clarification. Thanks.

On Mon, Oct 31, 2022, 9:21 AM Stephen Shank @.***> wrote:

Megahit BCO is here https://data.hyphy.org/web/argos/wuhan-reassembly/Megahit.json but again for emphasis I have not yet assessed the fidelity to the original assembly.

If I understand correctly from #113 https://github.com/FDA-ARGOS/data.argosdb/issues/113 the description domain should be populated based on the tools due to the work of @HadleyKing https://github.com/HadleyKing .

— Reply to this email directly, view it on GitHub https://github.com/FDA-ARGOS/data.argosdb/issues/114#issuecomment-1297079899, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3GISJFWGVAUTYZ5YNRTR3WF7BUBANCNFSM6AAAAAARRMYIN4 . You are receiving this because you were mentioned.Message ID: @.***>

JingyueWu commented 1 year ago

Hi all,

Based on this thread of discussion and our meeting with Stephen this afternoon, I'm going to post a list of actionable items in chronological order for me to work on. As Stephen said earlier, at this point we are dealing with his Megahit BCO:

I am aiming to complete the above by the end of this week, please let me know if you have any questions, thanks.

stephenshank commented 1 year ago

@JingyueWu Here is the reassembled FASTA. As I am seeing quite often, there is some noise at both ends, but in the interior of the alignment (done manually) it appears to also match the original assembly, which was quite pleasing. I've uploaded a pairwise alignment between the two for anyone who would like to double check, and have started developing tools to assess such comparisons.

Usability domain description: "Reads were downloaded with fasterq-dump, controlled for quality with fastp, and then mapped to the human genome with BWA-MEM. Any reads flagged as mapped are subsequently filtered out, thereby removing reads associated with the host. Remaining reads are then converted back into FASTQ format and assembled using Megahit. The longest contig is automatically extracted, and was manually aligned to the original assembly to assess orientation."

Keywords: assembly, quality control, megahit

These also pertain to the BCO in #113. Thanks for pointing out what was missing in the BCO, I've refreshed my memory on how the Galaxy UI populates these fields and they should be automatically included in subsequent submissions.

stephenshank commented 1 year ago

My ORCID: https://orcid.org/0000-0003-0734-9953

rajamazumder commented 1 year ago

Stephen, I checked the alignment you sent and it looks like several nucleotides are missing from both ends. We got similar results using Megahit (see image below of both ends of the alignment). We got much better results (which is a relative term) using MetaviralSPAdes (see image below). Do you have access to MetaviralSPAdes and want to try?

[image: a577356c-6feb-4e46-8f59-5abe86f876be.png]

-- Raja Mazumder, Ph.D. Professor Department of Biochemistry and Molecular Medicine School of Medicine & Health Sciences The George Washington University Ross Hall, Room 540 2300 Eye Street N.W. Washington, DC 20037 Phone office: 202-994-5004 Phone lab: 202-994-3639 Phone dept: 202-994-5311 Fax: 202-994-8974

On Tue, Nov 1, 2022 at 2:55 PM Stephen Shank @.***> wrote:

@JingyueWu https://github.com/JingyueWu Here is the reassembled FASTA https://data.hyphy.org/web/argos/wuhan-reassembly/reassembly.fasta. As I am seeing quite often, there is some noise at both ends, but in the interior of the alignment (done manually) it appears to also match the original assembly, which was quite pleasing. I've uploaded a pairwise alignment https://data.hyphy.org/web/argos/wuhan-reassembly/plucked_and_aligned.fasta between the two for anyone who would like to double check, and have started developing tools to assess such comparisons.

Usability domain description: "Reads were downloaded with fasterq-dump, controlled for quality with fastp, and then mapped to the human genome with BWA-MEM. Any reads flagged as mapped are subsequently filtered out, thereby removing reads associated with the host. Remaining reads are then converted back into FASTQ format and assembled using Megahit. The longest contig is automatically extracted, and was manually aligned to the original assembly to assess orientation."

Keywords: assembly, quality control, megahit

These also pertain to the BCO in #113 https://github.com/FDA-ARGOS/data.argosdb/issues/113. Thanks for pointing out what was missing in the BCO, I've refreshed my memory on how the Galaxy UI populates these fields and they should be automatically included in subsequent submissions.

— Reply to this email directly, view it on GitHub https://github.com/FDA-ARGOS/data.argosdb/issues/114#issuecomment-1298968967, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3GISMEJTCJ6P4WRIR3PODWGFRQRANCNFSM6AAAAAARRMYIN4 . You are receiving this because you were mentioned.Message ID: @.***>

stephenshank commented 1 year ago

@rajamazumder Indeed, I mentioned the noise at the ends. There is actually some noise in the interior of #113 as well so I'd like to do a comparison of approaches this month, along with some visualization.

I've installed metaviralspades on our Galaxy instance. I can give it a try and report for tomorrow. At this point my greatest concern is: does it work on collections so this can be done in batches?

Can you please provide your FASTA and describe what you mean by "much better"? I cannot see the image on Github.

stephenshank commented 1 year ago

There seems to be some minor issues with their tool wrapper, or how I am integrating collections with it, but metaviralspades is now running on this dataset in the same history. Edit: It appears to have thrown an out of memory error.

stephenshank commented 1 year ago

@steph-sing @JingyueWu Suggestions for either an approriate header line (instead of reassembled), or the right person to inquire about such a matter with, are appreciated.

steph-sing commented 1 year ago

@stephenshank sorry I did not see this comment for some reason. I just reset my email notifications on Friday so hopefully this issue won't happen again.

Can you confirm if this was reference guided or not? (I know we have talked about this a couple times, but I want to be sure before I give you the header)

stephenshank commented 1 year ago

@steph-sing For this one it's full de-novo (got lucky). I am further refining reference-guided and hybrid strategies as well.

steph-sing commented 1 year ago

@JingyueWu can you confirm if there is a BCO ID related to this fasta, or allocated to this fasta and pipeline? this will be the last step in determining the header. Thanks

stephenshank commented 1 year ago

FASTA and draft BCO.

Orientation of the plucked contig is not automatic in this BCO (it's an older BCO, and this has been fixed in subsequent work). Awaiting confirmation on an appropriate header. Uncovered some bugs with Galaxy BCOs. Usability domain and keywords now autopopulate from Galaxy workflow. Links to input, workflow, and output should be functional.

Feedback is much appreciated.

rajamazumder commented 1 year ago

This is still the Wuhan assembly I am guessing. For now just use a simple fasta definition line like this. >SRA11111111_Megahit assembly for Wuhan-Hu-1 complete genome. We can start collecting then and then decide on a better plan

-- Raja Mazumder, Ph.D. Professor Department of Biochemistry and Molecular Medicine School of Medicine & Health Sciences The George Washington University Ross Hall, Room 540 2300 Eye Street N.W. Washington, DC 20037 Phone office: 202-994-5004 Phone lab: 202-994-3639 Phone dept: 202-994-5311 Fax: 202-994-8974

On Wed, Nov 9, 2022 at 6:36 PM Stephen Shank @.***> wrote:

FASTA https://data.hyphy.org/web/argos/wuhan-reassembly/reassembly.fasta and draft BCO https://biocomputeobject.org/ARGOS_000041/DRAFT.

Orientation of the plucked contig is not automatic in this BCO (it's an older BCO, and this has been fixed in subsequent work). Awaiting confirmation on an appropriate header. Uncovered some bugs with Galaxy BCOs. Usability domain and keywords now autopopulate from Galaxy workflow. Links to input, workflow, and output should be functional.

Feedback is much appreciated.

— Reply to this email directly, view it on GitHub https://github.com/FDA-ARGOS/data.argosdb/issues/114#issuecomment-1309542412, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3GISL6WZTLKEH4JRA2AWTWHQYPPANCNFSM6AAAAAARRMYIN4 . You are receiving this because you were mentioned.Message ID: @.***>

JingyueWu commented 1 year ago

@HadleyKing

1) Can you please reassign the prefix of this BCO from "ARGOS" to "ARG"? As per our discussion, all assemblies need to be created under "ARG" prefix, in lieu of "ARGOS" 2) Alternatively if you can't fix the above, we can potentially use the empty "ARG_" to populate with Stephen's BCO for Wuhan

Please let me know, technically both could work but we need to go with the fastest, most correct, and most efficient approach. Thanks

HadleyKing commented 1 year ago

Done. ARG_000001/DRAFT is now identical to ARGOS_000041/DRAFT.

steph-sing commented 1 year ago

@JingyueWu it looks like @stephenshank attached the Wuhan Fasta about a week ago (above in the comments), and it seems like you have everything you need for the BCO? If so, please complete the following: -[] review the Fasta file - make sure the header is correct -[] review the BCO and confirm all materials you need are included in the BCO. If not, please reach out to the appropriate party to complete this. -[] assign me to review the BCO so I can publish it -[] send Emily/Joe Stephen's Fasta file with cc to me, so they can QC it. Provide the ID to them (ask me if you don't know what I mean by this). These data will go in our assemblyQC_HL table -[] @stephenshank include the metrics gathered during your assembly in your assemblyQC_PL dataset per V1.0 data dictionary, and send us an updated table (this inlcudes metrics from generating the Ebola sudan fasta file as well

JingyueWu commented 1 year ago

This task is completed

steph-sing commented 1 year ago

Thank you for the update!

steph-sing commented 1 year ago

Status update: need to fix BCO issue before pushing.

HadleyKing commented 1 year ago

https://biocomputeobject.org/objects/view/https/biocomputeobject.org/ARG_000001/23.1 This should be done now.

steph-sing commented 1 year ago

Status: BCO issue is currently being fixed by @HadleyKing - still unable to get the fasta to the server

HadleyKing commented 1 year ago

The BCO Linked above is published and should be ready to finish processing. I moved the status of this ticket to in review on the project board. @steph-sing please review

steph-sing commented 1 year ago

I will test the push today with Jonathon.

steph-sing commented 1 year ago

@HadleyKing I am still getting errors with this this Biocompute Object (https://www.biocomputeobject.org/objects/view/https/biocomputeobject.org/ARG_000001/23.2) and I am unable to push this card to the server. The Errors are as follows:

Errors-I

"bco_id","property_path","error","error_category" "ARG_000001","extension_domain.0.dataset_extension","missing required property path","FATAL" "ARG_000001","io_domain.output_subdomain.0.uri.filename","file extension 'ObservableHQ on data 43 and data 39' not allowed","FATAL"

Please address the errors with this object.

HadleyKing commented 1 year ago

I rearranged the problematic parts.

steph-sing commented 1 year ago

New Error:

Errors-I

"bco_id","property_path","error","error_category" "ARG_000001","io_domain.output_subdomain.0.uri.filename","file extension 'NC_045512_SARS-CoV-2_Wuhan' not allowed","FATAL"

I'm unsure what about this file name would deem it a fatal error? never seen this error before

HadleyKing commented 1 year ago

The filename field was missing an extension. I added .fasta. Try again please

steph-sing commented 1 year ago

:( well that was dumb. but it worked. thank you!

steph-sing commented 1 year ago

Jonathon will aim to push to the DB tonight

steph-sing commented 1 year ago

We are getting errors in the server and have been unable to push this. @kee007ney is looking into this today, and will assign Robel a ticket as needed.

steph-sing commented 1 year ago

We have attempted to push this 3 times, all resulting in errors. We will aim to include it in tomorrow's V1.41 data push - hopefully with success. Will update by COB Dec 21.

steph-sing commented 1 year ago

Passed all BCO checks and will be included in the V1.41 push. Considered complete for Dec 2022.