Open CholoTook opened 3 years ago
It is indeed not the clearest part of the code and pretty much absent from the documentation: The assembly report is expected to have 10 columns and it is recording the content of column 1, 5, 7, and 10
The assembly report that match this description can be found on Genbank FTP like this one
If the first column (CHROM) of the VCF and the first word (anything before the first white space) of the fasta header contains any of the synonyms found in the columns mentioned above from the assembly report then they are matched.
I hope this helps.
Cool, so (just to check I understand) if CHROM is in column 5 and 'the first word' of the fasta header is in column 1, or the other way round, for example, either would be a match?
Adding your text to the documentation would be enough I think.
I'm playing with an assembly mapping where the chromosome was initially called 1, 2, 3, etc., then got renamed to chr1, chr2, chr3, etc. It could be nice to add a 'chr stripped' (or 'chr prepended') ID to the list of synonyms.
BTW, since you're here ;-) Does the vcf_assembly_checker look for matching sequence lengths to 'validate' the assembly report?
Also, I initially thought I should make the sequence.fna.fai using
makeblastdb
, but then realised it was the samtools format fasta index...
Why do you build and then discard the fasta index? You mention it's
required and then silently create it (and then discard it) on the fly... I
was wondering why the tool was running so slow until I realised that
makeblastdb wasn't producing the .fai.
Many thanks, Dan.
On Thu, 24 Jun 2021 at 12:53, Timothee Cezard @.***> wrote:
It is indeed not the clearest part of the code and pretty much absent from the documentation: The assembly report is expected to have 10 columns https://github.com/EBIvariation/vcf-validator/blob/78cadd491d1d1e25fb5e8538072ba86c7272db2e/inc/assembly_report/assembly_report.hpp#L112 and it is recording the content of column 1, 5, 7, and 10 https://github.com/EBIvariation/vcf-validator/blob/78cadd491d1d1e25fb5e8538072ba86c7272db2e/inc/assembly_report/assembly_report.hpp#L181
The assembly report that match this description can be found on Genbank FTP like this one https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/002/285/GCA_000002285.2_CanFam3.1/GCA_000002285.2_CanFam3.1_assembly_report.txt
If the first column (CHROM) of the VCF and the first word (anything before the first white space) of the fasta header contains any of the synonyms found in the columns mentioned above from the assembly report then they are matched.
I hope this helps.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/EBIvariation/vcf-validator/issues/212#issuecomment-867576218, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANKSZTTEYA6WETFDTCQDJ6DTUMMERANCNFSM47HVJBCA .
Which columns of the assembly report are used by the assembly checker to define synonyms?
Enquiring minds demand to know! ;-)
Many thanks, Dan.