Clinical-Genomics / scout

VCF visualization interface
https://clinical-genomics.github.io/scout
BSD 3-Clause "New" or "Revised" License
152 stars 46 forks source link

Assembly tracks in IGV #4982

Open fellen31 opened 1 month ago

fellen31 commented 1 month ago

Hi,

What do you think about the possibility to view the mapped assembly as tracks in IGV? That could be shown (next to the reads), if files are included in the scout config?

Currently Nallo outputs each haplotype into separate bam-files.

dnil commented 1 month ago

Hi, could you link a corresponding PR/issue on Nallo or the tool you use to produce these to make it easier to follow?

It sounds like you do a de novo assembly into two separate bam files. Are these with their own consensus, or already aligned back to a reference genome? If the latter, this should be relatively straightforward, though it is a little unclear why we don't just use haplotype tagging in the joint cram instead? That was what we had previously, right?

fellen31 commented 1 month ago

If the latter, this should be relatively straightforward, though it is a little unclear why we don't just use haplotype tagging in the joint cram instead?

Yes, within Nallo hifiasm produces a dual de novo assembly (hap1 and hap2), that are converted to two fasta files. These are then aligned back to the reference and saved as bam files. If it suits scout better, I could look into adding tags to the contigs and joining the bam files.

That was what we had previously, right?

I don't think so :)

fellen31 commented 1 month ago

Hi, could you link a corresponding PR/issue on Nallo or the tool you use to produce these to make it easier to follow?

I created an issue, if you think it's better discussed there.

dnil commented 1 month ago

That was what we had previously, right?

I don't think so :)

We have looked at WhatsHap haplotype tags in another issue, on a test case loaded from Nallo, so was under the impression that was the state of the art a little bit back?

fellen31 commented 1 month ago

For aligned reads, yes, reads come in as a single file, are tagged with the help of phased variants and output as a single file. But for the aligned assemblies, as they are output as two separate files from the beginning, there has not yet been a need to tag them.

dnil commented 1 month ago

Right, got that part, I mean it is very similar data? We had the HP info from WhatsHap before.

Like so: additional tracks can certainly be arranged - you will need to register output types with Hermes, pass them to Housekeeper, export them through CG. Given the key names we can add them to the load model, and to track generation. All good. Tags would be immediately available.

But the real question is perhaps in the quality of the data? If de novo data remapped to reference is better overall than the reference guided/mapped ones perhaps we should just replace the cram in the first place?

dnil commented 1 month ago

Oh, does the hifiasm output include the underlying reads as well, or is it "only" the haplotype consensus?

fellen31 commented 1 month ago

Right, got that part, I mean it is very similar data? We had the HP info from WhatsHap before.

Yes!

Like so: additional tracks can certainly be arranged - you will need to register output types with Hermes, pass them to Housekeeper, export them through CG. Given the key names we can add them to the load model, and to track generation. All good. Tags would be immediately available.

Sounds good! I'll hack together a test-case with HP-tags to upload instead of the normal bam, so we can see what it looks like.

But the real question is perhaps in the quality of the data? If de novo data remapped to reference is better overall than the reference guided/mapped ones perhaps we should just replace the cram in the first place?

I don't think we can do that switch (yet). I think in some regions the assembly will be better, in others the reads will be more informative. But I believe they are complementary, and might help/serve as a control if visually inspecting variants.

Oh, does the hifiasm output include the underlying reads as well, or is it "only" the haplotype consensus?

No, it's only the consensus sequence.

dnil commented 1 month ago

Sounds good! My thoughts would be that it is not super useful to have only the concensus in isolation from the assembly graph, reads or some sort of quality scores. Some regions will of course be super solid, whereas others are separated by regions of a bit more tentative scaffolding, and consensus calling likewise will be distinct for the more unique parts of the genome, then a bit more fuzzy for the more difficult parts. As good a combo as we can get without too much effort is good: of course it helps to have the de novo concensus tracks as one more source of evidence when eg considering if two variants are on the same allele or not.

fellen31 commented 4 weeks ago

Uploading in one file with HP-tags seems to works well, it would be a fairly minor addition to nallo. Let me know what you if you think this is what you would prefer.

Screenshot 2024-10-28 at 14 42 46