epi2me-labs / wf-human-variation

Other
87 stars 41 forks source link

Better hg38 genome? #105

Closed ymcki closed 8 months ago

ymcki commented 8 months ago

Ask away!

I found that some bioinformatics software are using this hg38 genome https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/latest_release/GRCh38.primary_assembly.genome.fa.gz It is essentially the same one as the genome recommended here https://github.com/epi2me-labs/wf-human-variation/issues/36 ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

except for these differences:

  1. It is missing EBV sequence. (which can be added easily if we want)
  2. It doesn't have any IUPAC nucleotides at all. (I presume that means it is better polished with newer technology?)
  3. Naming of the alternative chromosomes are not the same (again, it can be fixed)

Would adding EBV to GRCh38.primary_assembly.genome.fa make it better genome than GCA_000001405.15_GRCh38_no_alt_analysis_set.fna due to 2?

Would it cause any potential problems with the current pipeline?

SamStudio8 commented 8 months ago

@ymcki Our recommendation of GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz is based on this blog post which outlines potential pitfalls with the various flavours of hg38 that you might find useful. I would hesitate to comment on what you propose to do with your reference without any benchmarking data!