ablab / VerityMap

GNU General Public License v3.0
30 stars 5 forks source link

docs/readme: output files #17

Closed ptrebert closed 2 years ago

ptrebert commented 2 years ago

Hi, I just used veritymap (i.e., repo checkout from today) on a small test dataset, and I cannot fully match the generated output to what is described in the readme; can you clarify what is what:

1. ptg000001l_kmers_dist_diff.html
2. test-asm.bed
3. testasm_errors.bed
4. testasm_kmers_dist_diff.bed
5. testasm_reads_dist_diff.txt
6. test-asm.sam
7. test-asm.txt

1 - ok; interactive report 2 - ok; alignments in BED format 3 - ? has BED extension, but I don't recognize that as a BED file 4 - somewhat explained in the readme, but I can't make sense of the "length" (which is always negative) and of the "frequency" column (integer); can you clarify? 5 - presumably the list of reads discordant with the assembly in the respective region 6 - ok; SAM alignments 7 - some sort of chain file

Thanks!

+Peter

seryrzu commented 2 years ago

Hi Peter,

You are right that some files are poorly described (or not described at all) in the README. Most of these you can ignore as they are rather technical and are mostly for debugging.

3 - testasm_errors.bed — this is a summary of putative misassemblies that is not particularly helpful for you. Please ignore it, and I removed its export from main.

4 - testasm_kmers_dist_diff.bed — the length is negative because there is a putative deletion from the reference. If it was an insertion, it would be positive. Typically misassemblies look like deletions rather than insertions, but it is not always true. The format is {ref_name} {start} {end} {mism_len} {% discordant reads}. Please do not pay much attention to the last value, and use the html to see % discordant reads. 5 - correct

7 - yes, this is the chain file for alignments. Technical output.

I updated the Readme to reflect these notes.

ptrebert commented 2 years ago

Thanks a lot, that is very helpful. One more question: I have one sample that was successfully processed by VerityMap (at least no error reported), but the .sam output is empty (apart from the header). Is that as intended assuming no problems were detected, or rather cause for concern (i.e., the .sam output should always be non-empty)?

seryrzu commented 2 years ago

Hmm, sam file should be non-empty. It seems like no reads were aligned to the assembly. Could you share the data with me via slack or email for me to take a look? My first guess is to the check the kmer index (should be saved in a tsv). If it is not empty, then I would check the file with chains. Is your assembly haplotype-resolved (diploid etc)?

ptrebert commented 2 years ago

I can share the data later today (Globus would be preferred for sharing reads etc). The assembly is a single (haploid) chromosome, could that be a problem? Same for the input reads. The k-mer index is non-empty, but I just realize right now that no chain file has been dumped.

seryrzu commented 2 years ago

That's strange — a haploid chromosome should not be a problem. Please share the globus link when convenient and I will take a look. Thanks!

ptrebert commented 2 years ago

sorry, there is a delay; I have restarted the pipeline, and the reads have not yet been dumped again for the problematic sample. I'll ping you here when the data are available...

ptrebert commented 2 years ago

Just to close this: I investigated the empty output problem (sorry, should have reported that as separate issue) and discovered an error in the input file. Fixing that resulted in non-empty output. Thanks for updating the readme info about the output files.

One last comment or feature request about the output: it would be nice if the information in the HTML output would be directly available as computer-readable text/table for easier aggregation when processing many samples on a cluster.