Create ranked list of BioProjects for Co-Assembly

taltman commented 3 years ago

Some SRA samples have tantalizing signal that they contain a novel CoV, but there's insufficient read coverage in a single SRA run to assemble a complete genome sequence. For some of these fragmented assemblies, we may be able to perform co-assembly with SRA runs from the same BioProject, so that we're more likely pooling reads from the same strain.

The criteria for this list are as follows (@rcedgar , @ababaian feel free to refine):

There is summarizer data indicating that there's a CoV sequence in the SRA
The current assembly has more than one contig, and/or is shorter than expected
The reads have <97% AA identity to the closest known genome
The SRA coverage to the nearest neighbor is low
Querying NCBI, there are additional SRAs from the same BioProject that can be pooled

Steps 1, 3, & 4 can be done using Tantalus SQL queries. Step 2 can be pulled from the 'master table' that @rchikhi is updating. This then can be joined with step 5 (querying NCBI) to come up with a final table for prioritizing co-assembly.

Then, I can hand it off to @McGlock for co-assembly.

taltman commented 3 years ago

@ababaian, are there folks working on Tantalus that could help with Steps 1, 3, and 4?

rcedgar commented 3 years ago

Suggest changing the title of the issue because it's not a given that candidates need to be in the same BioProject.

I propose using two complementary approaches to find candidates for co-assembly.

(A) Identify fragmented assemblies that are potentially the same virus. This can be done from Rayan's master table by finding shorter contigs with similar nearest neighbor & identity. These do NOT need to be in the same BioProject -- the best check that they are the same virus is to align CS contigs and verify identity is ~100%. Suggest only samples with a positive detection by the summarizer need be included -- if a sample does not show up in the summary reports the the coverage is surely too low to be useful. To find these candidates will require writing a script in Perl/Python/R to trawl the assembly master table.

(B) Identify low-coverage samples (?in the same BioProject) using summarizer reports. These may never have been assembly targets, so we would lose them by considering only assemblies. I don't think it's necessary that these should be in the same BioProject, though it makes the search easier to implement and increases the probability that the virus is the same. We might be able to find a few more assemblable viruses by casting a wider net; maybe try searching across multiple BioProjects if time allows after doing the easier cases to find. The implementation here is probably some combination of SQL and post-processing in a Perl/Python/R script.

rcedgar commented 3 years ago

Further thoughts: FWIW, I would guess there will be no more than a handful of good targets for co-assembly and it will be worth doing a manual review for each candidate. IMO it's not worth going to this much trouble to capture one more sub-strain of PEDV or bronchitis, hence discarding everything >97% to known which I'm guessing will cut this down to a very short list. Conversely, it is worth a lot of trouble to find a new species, which is why I would be a bit reluctant to limit the search to samples where we already have a contig and/or to samples in the same BioProject.

ababaian commented 3 years ago

There is summarizer data indicating that there's a CoV sequence in the SRA

I wouldn't worry about things that did not make it past the first 'assembly' filter we defined. The starting point should be the ~55K we assembled that then ...

The current assembly has more than one contig, and/or is shorter than expected

Do not have a complete genome AND

The reads have <97% AA identity to the closest known genome

Are not more of the same for what's repeated. Essentially start with the most distant members and work your way "up" from there. Expand this to include "closest known GenBank sequence", be it fragment, genome or w/e

The SRA coverage to the nearest neighbor is low

I would say the above criteria is more important then w/e the coverage values are.

Querying NCBI, there are additional SRAs from the same BioProject that can be pooled

I would caution against co-assembly with different BioProjects, it may be worth trying to 'stitch' together a novel genome if we think there are overlaps in different projects and then go back to each individual BioProject and align all the contigs to those stitched genomes. Unfortunately I don't see a good way to automate this process for big scale, this is trench-warfare to get complete genomes one novel species at a time. This can be retrieved via the SraRunInfo tables See SRA-Queries. which unfortunately are not in the SQL database yet. With those list of SRA in hand it's fairly trivial to pull the summary reports for all the SRA in that bioproject. Although currently this only applies to the nt-reports and the aa-reports are not organized into a DB yet.

Edit: Robert actually has the shortlist of samples on this, just MSG him and he'll send it your way.

asl commented 3 years ago

I did some checks previously:

For SRR5234495 we’re only having 1 sample in the project. There are 3 other SRAs there, but for embryo, not for juvenile. Our assembly contains 16 kbp isolated contig.
For SRR1324965 there is also SRR1324966. SRR1324965 is ~11 kbp contig and SRR1324966 is quite empty.

asl commented 3 years ago

I tried to co-assembly SRR1324965 and SRR1324966. Note that the contig had quite decent coverage and SRR1324966 was empty. There was no improvement.

ababaian / serratus

Create ranked list of BioProjects for Co-Assembly #217