If there is a way I can know which raw reads go to a specific contig?

WenyuLiang commented 2 years ago

Hi! If there is a way I can know which raw reads go to a specific contig?

paoloczi commented 2 years ago

For haploid assembly, you can do that using the following command line option:

--Assembly.writeReadsByAssembledSegment

I have not tested this option in some time, so if you bump into problems please post here and I will look into it.

For diploid assembly, this functionality is not available.

paoloczi commented 2 years ago

A bit more information on that option. If you turn it on, the assembly directory will contain a csv file named ReadsBySegment.csv. The top of the file looks like this:

The meaning of the columns is as follows:

AssembledSegmentId identifies an assembled segment (same identifier used in other assembly output such as Assembly.fasta).
EdgeCount is the length of that assembled segment (number of edges) in the marker graph.
OrientedReadCount is the number of oriented reads that were used to assemble the segment. An oriented read is a read in either the original orientation, or with reverse complement.
OrientedReadId is the Shasta internal id of a read that was used to assemble the segment. It uses the format ReadId-Strand where Strand can be 0 (original orientation) or 1 (reverse complemented). So for example 66-1 means read 66, reverse complemented. To convert the Shasta internal ReadId to the read name in the input fasta/fastq files, you can use the first two columns of ReadSummary.csv. VertexCount and EdgeCount are the number of marker graph vertices and edges, respectively, that the given oriented reads appear on, out of the vertices and edges that make up the assembled segment.

WenyuLiang commented 2 years ago

Thank you so much!!!

paoloczi commented 2 years ago

I am closing this due to lack of additional discussion. If other questions emerge, feel free to open another issue.

chanzuckerberg / shasta

If there is a way I can know which raw reads go to a specific contig? #294