EBIvariation / vcf-validator

Validation suite for Variant Call Format (VCF) files, implemented using C++11
Apache License 2.0
130 stars 39 forks source link

Format of assembly report isn't clear #212

Open CholoTook opened 3 years ago

CholoTook commented 3 years ago

Which columns of the assembly report are used by the assembly checker to define synonyms?

Enquiring minds demand to know! ;-)

Many thanks, Dan.

tcezard commented 3 years ago

It is indeed not the clearest part of the code and pretty much absent from the documentation: The assembly report is expected to have 10 columns and it is recording the content of column 1, 5, 7, and 10

The assembly report that match this description can be found on Genbank FTP like this one

If the first column (CHROM) of the VCF and the first word (anything before the first white space) of the fasta header contains any of the synonyms found in the columns mentioned above from the assembly report then they are matched.

I hope this helps.

CholoTook commented 3 years ago

Cool, so (just to check I understand) if CHROM is in column 5 and 'the first word' of the fasta header is in column 1, or the other way round, for example, either would be a match?

Adding your text to the documentation would be enough I think.

I'm playing with an assembly mapping where the chromosome was initially called 1, 2, 3, etc., then got renamed to chr1, chr2, chr3, etc. It could be nice to add a 'chr stripped' (or 'chr prepended') ID to the list of synonyms.

BTW, since you're here ;-) Does the vcf_assembly_checker look for matching sequence lengths to 'validate' the assembly report?

Also, I initially thought I should make the sequence.fna.fai using makeblastdb, but then realised it was the samtools format fasta index... Why do you build and then discard the fasta index? You mention it's required and then silently create it (and then discard it) on the fly... I was wondering why the tool was running so slow until I realised that makeblastdb wasn't producing the .fai.

Many thanks, Dan.

On Thu, 24 Jun 2021 at 12:53, Timothee Cezard @.***> wrote:

It is indeed not the clearest part of the code and pretty much absent from the documentation: The assembly report is expected to have 10 columns https://github.com/EBIvariation/vcf-validator/blob/78cadd491d1d1e25fb5e8538072ba86c7272db2e/inc/assembly_report/assembly_report.hpp#L112 and it is recording the content of column 1, 5, 7, and 10 https://github.com/EBIvariation/vcf-validator/blob/78cadd491d1d1e25fb5e8538072ba86c7272db2e/inc/assembly_report/assembly_report.hpp#L181

The assembly report that match this description can be found on Genbank FTP like this one https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/002/285/GCA_000002285.2_CanFam3.1/GCA_000002285.2_CanFam3.1_assembly_report.txt

If the first column (CHROM) of the VCF and the first word (anything before the first white space) of the fasta header contains any of the synonyms found in the columns mentioned above from the assembly report then they are matched.

I hope this helps.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/EBIvariation/vcf-validator/issues/212#issuecomment-867576218, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANKSZTTEYA6WETFDTCQDJ6DTUMMERANCNFSM47HVJBCA .