how deal with outgroup chr2A chr2B to chr2

jackzhong1995 commented 3 months ago

Hi, Thanks for the great tool to search archaic segments.

I just want to know how can i make outgroup chr2.fa.gz file as the chimpanzee has chr2A and chr2B？I want to use the "Pan_troglodytes.Pan_tro_3.0.dna.chromosome.*.fa.gz" files form ensembl version, can I?
Another question is that my data's reference is hg38，can I use the files you refer to (chr*.hg19.chimp.fa.gz) directly？ I hava download them from the path you shared (https://drive.google.com/drive/folders/115LSXmYDlitNKDO58SgxbEYlNd4EG1WK)?

Best wishes.

jackzhong1995 commented 3 months ago

By the way, are there any "homo_sapiensancestor*.fa.gz" files refer to GRCh38 can be find? Or how to make the GRCh38 version?

yorkklause commented 3 months ago

Hi Jackzhong,

Thank you for choosing our software.

Regarding the chimp reference file, it's important that all data are needed to align to the same genomic coordinate system (GRCh37 or 38). However, I don't think "Pan_troglodytes.Pan_tro_3.0.dna.chromosome.*.fa.gz" align with the human GRCh37 or 38 references. Therefore, it's not suitable for our software.

I recommend utilizing data from UCSC if you intend to use human data in GRCh38 (https://hgdownload.soe.ucsc.edu/goldenPath/hg38/vsPanTro6/) and then converting the axt file to fasta (https://main.genome-browser.bx.psu.edu/goldenPath/help/axt.html).

As for the second question, we were unable to find a fasta format ancestor variant file in GRCh38. Even if you manage to find a file in GRCh38, you'll still need to obtain Neanderthal and Denisovan files in GRCh38 format.

Our suggestion is to utilize the leftover software to covert your data from GRCh38 to GRCh37.

My best

Kai Yuan

jackzhong1995 commented 3 months ago

Thanks for your so fast reply and your good suggestion!

However, my data is difficult to convert as it's too large. Now, i have got all files (with GRCh38 version) prepared except "homo_sapiensancestor*.fa.gz". Could you please tell me how do you create these files? Or where can i find the method to make these files (or in which paper?).

Best wishes.

Hi Jackzhong,

Thank you for choosing our software.

Regarding the chimp reference file, it's important that all data are needed to align to the same genomic coordinate system (GRCh37 or 38). However, I don't think "Pan_troglodytes.Pan_tro_3.0.dna.chromosome.*.fa.gz" align with the human GRCh37 or 38 references. Therefore, it's not suitable for our software.

I recommend utilizing data from UCSC if you intend to use human data in GRCh38 (https://hgdownload.soe.ucsc.edu/goldenPath/hg38/vsPanTro6/) and then converting the axt file to fasta (https://main.genome-browser.bx.psu.edu/goldenPath/help/axt.html).

As for the second question, we were unable to find a fasta format ancestor variant file in GRCh38. Even if you manage to find a file in GRCh38, you'll still need to obtain Neanderthal and Denisovan files in GRCh38 format.

Our suggestion is to utilize the leftover software to covert your data from GRCh38 to GRCh37.

My best

Kai Yuan

yorkklause commented 3 months ago

Hi Jackzhong,

We obtained the ancestor state fasta files from the 1000 Genome Project (low-coverage b37 version). You can find them at the following FTP link: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/retired_reference/ancestral_alignments/.

In their README file and paper, they provided the method to obtain the fasta file. I'm not sure about the complexity of this process. An alternative approach could be to use the chimp fasta as an ancestor if obtaining the files proves challenging. This ancestor fasta file is solely used to determine the ancestral and derived states.

My best

Kai Yuan

Shuhua-Group / ArchaicSeeker2.0

how deal with outgroup chr2A chr2B to chr2 #8