Closed dirkjanvw closed 1 year ago
There is no size-dependent behaviour here that I can think of: if you pass in a sample via --reference
it should always end up as a surjectable REFERENCE-sense path in the final output, just like you are seeing for the yeast example.
I think a few versions ago, the multiple reference support may have been a bit buggy. So if you ran your larger data with an older version of cactus and yeast with the latest version, that would definitely explain it.
You might be able to see where it's going wrong in the log. Your reference genomes should appear in hal2vg --reference
parameters.
They should also get added to the GFA header with a sed command
-e 1s/.*/H VN:Z:1.1 RS:Z:YPS128 S288C _MINIGRAPH_/"
Finally, you can take the outpout GFA file, and add the genomes to the RS tag in the header by hand if you want. After that you can regenerate your GBZ file (look for command in log) and all paths will be REFERENCE -- it's really that header that controls it.
Thanks a lot for all the suggestions!
I double checked the version and the version i used was 2.6.7-gpu for the large one.
Diving into the log using your tips, I did find something strange. First off, the hal2vg
commands looked fine. I see the exact same command (including all references specified) being run on a temporary hal file 10 times (probably corresponding to my 9 chromosomes + chromosome 0?). However, the sed
command you mentioned is not identical for all times it is run. For my chromosome 0 it only mentions my genomeA whereas for the other 9 chromosomes it mentions all references. Also the resulting GFA file (large-pg.gfa.gz
) has this header: H VN:Z:1.1 RS:Z:genomeA
, without the other references. Could it be that the pipeline takes the smallest set of references it can find for all chromosomes? By the way, I did double check all input fasta files and all files have a chromosome 0 (but I would agree that I should have left out this chromosome 0 in the first place). In the output directory large-pg/chrom-subproblems/chr0/fasta/
I can see that indeed only genomeA has a non-empty file, whereas for the other chromosomes all references have non-empty files. (Probably because those chr0 couldn't be mapped to the minigraph graph?)
In conclusion, do you think I will solve the issue by removing the chromosome 0 from all input files? And is my hypothesis correct that the pipeline takes only those references that are present in all individual chromosome alignments? Personally, I would have expected the pipeline to still include all paths even if it cannot match a chromosome for all references.
Yeah, I think you've figured it out (thanks!!). I am able to reproduce it here on the yeast data by manually dropping a sample for the first chromosome. It doesn't get put in the GFA header for that chromosome even if it's specified with --reference
, and the final merged GFA inherits this header leading to the issue you described.
Dropping your chromosome-0 sounds like it would indeed be a work-around. As would hand-editing your GFA to fix the header, then regenerating the .gbz.
This definitely looks like a bug. I don't see why it doesn't just put every reference in every header whether or not it's used -- so I will update it to do this in the next release.
Aha, thanks for the temporary workarounds! And looking forward to the next release!
Hi, I have been trying out the Minigraph-Cactus pangenome pipeline with the provided yeast data set with subsequent mapping of short-reads to it.
vg surject
against any genome specified as "reference" works perfectly! However, when I use the exact same commands with a larger dataset (6 chromosome-level assemblies of ±2.5 Gbp each), I can only runvg surject
against the first specified "reference". I looked at the${prefix}.d2.gbz
files for both usingvg paths -MHx ${gbz}
and I noticed that for the yeast pangenome all genomes are indeed available as "reference" but that for the large pangenome, only the first assembly is available as "reference".This is how I ran the MC pangenome pipeline:
Btw, I masked out the names of the large pangenome because the data is not (yet) publicly available. If you have a dataset of the same size that I should try to confirm, I'm happy to run it.
This is the output of
vg paths
:From what I understand, only paths designated as "REFERENCE" in the second column can be used for
vg surject
? At least, that is my experience. This is how I would runvg surject
for completeness' sake:However, this would never work for the large pangenome because the other genomes specified with
--reference
are not included as "reference" paths in the GBZ file (only genomeA is):Is this because of the large size of my pangenome? If so, is there a specific flag I should set when working with pangenomes larger than yeast to make sure my other genomes are included as "reference" paths in the resulting graph?