Is shasta suitable for tumor sample with complex SV? - Githubissues

chanzuckerberg / shasta

[MOVED] Moved to paoloshasta/shasta. De novo assembly from Oxford Nanopore reads

Other

270 stars 57 forks source link

Is shasta suitable for tumor sample with complex SV? #266

Closed charliechen912ilovbash closed 2 years ago

charliechen912ilovbash commented 3 years ago

Hi, I am wondering if I can use shasta to assemble reads form tumor sample which contain complex/nested SV like deletion and inversion. Also, do you recommend to perform assembly before or after phasing? In the case of after phasing, I mean to separate reads into H1 and H2, then assemble them separately, obtaining H1 assembly and H2 assembly. Thank you very much!

paoloczi commented 3 years ago

To my knowledge, Shasta has not yet been used for tumor samples so far. However I have every reason to expect that, with sufficient coverage, it should be able to assemble the kind of structural variants you are interested in. However, the sample is going to be highly heterogeneous due to the presence of multiple cell populations, each with its own specific variants. Because of that, I would not suggest phasing the reads under the usual diploid assumptions. Instead, I suggest an assembly using all the reads, but reducing the amount of bubble removal normally done by Shasta to create haploid assemblies.

Specifically, I suggest running the assembly with the following options:

shasta --input ... --config Nanopore-Sep2020.conf --MarkerGraph.simplifyMaxLength 10

See here for a description of how that last options affects bubble removal in Shasta. If you don't have access to the Nanopore-Sep2020.conf configuration file, you can find it in the shasta/conf directory in the tar file for release 0.7.0, or simply download it from GitHub here.

Because of heterogeneity in the sample, most structural variants will be heterozygous. The assembly will consist of short linear assembled segments ("contigs") intermixed with "bubbles". The two (or more) branches of each bubble will contain the two or more sequences present in the data. I suggest using Bandage to inspect the gfa output of the assembly. Extracting a more explicit description of the het variants will require some processing of the gfa file.

Once you have an assemply feel free to post here for additional discussion, and I may be able to suggest additional steps.

paoloczi commented 2 years ago

I am closing this due lo lack of discussion. Feel free to reopen it or create a new issue as needed.

charliechen912ilovbash commented 2 years ago

Thanks very much for valuable detailed advice! Sorry that I was too busy recently to reply this comment. Currently I am solving tumor SV with the workflow: mapping > (SV detection) > phasing. This seems work well for non-complex SV. For complex SV, which contains multiple types of SV located very closely (<200bp). The situation become more complicated. I have one question about the method you mentioned that create haploid assemblies with shasta. Is the 'haploid assemblies' equal to the phasing result haplotype 1 and haplotype 2 ? or is it a collapsed assembly, where the parental alleles are randomly switching?

paoloczi commented 2 years ago

In Shasta haploid assemblies, at each heterozygous locus one of the two alleles is chosen (typically the one with the most coverage). This happens independently of nearby heterozygous loci, and as a result we get a collapsed assembly, where parental alleles are semi-randomly switching from one locus to the next.

As I suggested in my previous comment, you can also turn off bubble removal (at least for the larger bubbles) to keep both alleles of each heterozygous locus. That could result in better sensitivity for heterozygous SVs than just using the collapsed assembly.

charliechen912ilovbash commented 2 years ago

I see, thank you for the explanation. I will try the command you suggested. So first I can use Shasta to assemble all reads, produce contigs which includs both heterozygous alleles, map these contigs to reference, then detect SV and perform phasing with the mapped reads? If so, after assembly, what is the recommended assembly polisher? I read the Nanopore assembly manual which recommended medaka and racon. But in the Shasta paper, HALEN also seems to be a good choice, although it is tested with normal sample. Another question is, does it make sense to assess tumor assembly with some tools like BUSCO? I thought tumor genome may have different composition compared to normal genome. Sorry that I am new to assembly, so some questions may be basic. Thank you very much.

paoloczi commented 2 years ago

The procedure you have in mind sounds reasonable to me, except for the fact that, as I said in my first comment, I have doubts about doing phasing with a diploid assumption in a tumor genome where several cell populations are present.

Your other questions are outside my domain of competence, but I will ask some of my colleagues to contribute to this discussion.

charliechen912ilovbash commented 2 years ago

I see, thanks for the suggestion. Currently I am using germline SNP called on normal sample to phase the reads of the same sample and paired tumor sample, as some long-read sv studies did. Maybe this can avoid the high heterogenity issue in tumor at some level.

paoloczi commented 2 years ago

I think this will work, except in regions with large SVs.

paoloczi commented 2 years ago

I am closing this due lo lack of discussion. Feel free to reopen it or create a new issue as needed.