genepi / imputationserver

Michigan Imputation Server: A new web-based service for imputation that facilitates access to new reference panels and greatly improves user experience and productivity
https://imputationserver.sph.umich.edu/
GNU Affero General Public License v3.0
76 stars 40 forks source link

Input genotypes are expect to come from array genotypes with no more than 20000 SNPs expected per chunk. #70

Open bk11-lab opened 2 years ago

bk11-lab commented 2 years ago

Hi, I m trying to impute HLA in TopMed Imputed data using Michigan Imputation Server. However job fails stating "Calculating QC Statistics failed". Also, when I try to impute Michigan Server Imputed data using TopMed Imputation server, the job fails stating "Input genotypes are expect to come from array genotypes with no more than 20000 SNPs expected per chunk". In both the cases I have specified correct genome build. Will you please let me know what might be the reason for this problem? I appreciate your help. Thanks in advance.

BK

ghost commented 2 years ago

Hi, I am also getting the same error. Were you able to figure this out?

merveatik commented 1 year ago

Hi,

I am experiencing same error. Did you find an answer to that error? Could you please help me with that?

Thank you so much.

Sammypacoo commented 1 year ago

I asked to chatgpt The error message you received indicates that the number of SNPs in your upload data and their distribution across chunks exceeds the expected limit of 20,000 SNPs per chunk. This might be due to several reasons:

Data Source: If your data comes from a high-density genotyping array or sequencing, it's possible that you have a large number of SNPs in your dataset. This could result in exceeding the chunk size limit.

Data Format: The format of your input data might not match the format expected by the TOPMed Imputation Server. Different file formats or variations in headers could lead to discrepancies in SNP counts.

Data Preprocessing: If you've merged or combined datasets, the resulting dataset might have more SNPs than anticipated. Data preprocessing steps, such as merging or filtering, can impact the final SNP count.

Version of Reference Panel: Different versions of reference panels might contain varying numbers of SNPs. If you're using an older version of the reference panel, it might have fewer SNPs than your dataset, causing the SNP count to exceed the chunk limit.

Duplication or Overlapping: If there's overlap or duplication in your data, the same SNP might appear in multiple chunks, leading to inflated SNP counts.

Misalignment: If the reference panel you're using is not well-aligned with your dataset, it could lead to a mismatch in SNP positions and thus incorrect chunking.

Data Integrity: Ensure that your data is free from corruption and errors, as these issues could impact the accurate counting of SNPs.

To address this issue, carefully review your data preprocessing steps, data sources, and the reference panel you're using. Consider checking the data format and metadata requirements of the TOPMed Imputation Server to make sure your data meets their expectations. If the issue persists, reaching out to the support resources of the TOPMed Imputation Server could provide you with specific guidance tailored to your situation.

It could be because it came from a WGS?

ChatGPT Yes, that's a possibility. Whole-genome sequencing (WGS) data typically contains a very large number of genetic variants, including single nucleotide polymorphisms (SNPs), insertions, deletions, and other types of genetic variations. As a result, WGS data can have much higher density and complexity compared to genotyping array data.

If you are trying to perform imputation using WGS data on the TOPMed Imputation Server, it's quite likely that the sheer number of SNPs present in your data is causing the chunk size to exceed the server's limits. Imputing such a dense dataset can be computationally intensive and may require specialized approaches to manage the data.

Here are some steps you could consider:

Subset Your Data: Depending on your research goals, you might not need to use all the SNPs present in your WGS data for imputation. Consider filtering the variants to retain a more manageable subset that is relevant to your analysis.

Downsampling: If the TOPMed Imputation Server allows, you could downsample your data to reduce the number of SNPs and make it fit within the server's chunking limits. This may, however, sacrifice some level of resolution.

Consult the Server Documentation: Carefully review the documentation and guidelines provided by the TOPMed Imputation Server. They might offer specific recommendations for handling WGS data or provide guidance on preprocessing steps to prepare your data for imputation.

Alternate Tools: If the TOPMed Imputation Server isn't suitable for processing your WGS data due to its density, you might need to explore other imputation tools or platforms that are better suited for handling high-density data.

Reach Out for Support: If you're unsure about how to proceed, consider reaching out to the support team associated with the TOPMed Imputation Server. They might be able to provide more specific guidance based on your situation.

Remember that WGS data is considerably more complex and data-intensive than genotyping array data, so adapting existing tools and methodologies for imputation may require additional considerations and modifications.