Open HLHsieh opened 3 months ago
Hi Hsin, Thank you for your interest in HMMSTR! Here are my responses for your quesitons: 1) The only file difference in outputs between the coordinates and targets_tsv modes is that under coordinates mode the *_inputs.tsv file will be produced from the input coordinates. All other outputs should be equivalent 2) The input tsv can be any target sequence based on any species you would like! The benefit of this mode is that it is 100% reference free so hg38 or any other sequence you would like to target should be compatible in this mode. 3) HMMSTR should be compatible with any tandemly repeated DNA element, however if the sequence diverges too far from the expected repeat unit HMMSTR may return that there were no reads found with that underlying repeat sequence. As for your error, this appears to be a bug where I failed to initialize the 'num_supporting_reads' column for the first target returned with no supporting reads detected. In response to this I have patched where I believe the problem was introduced but please let me know if you are still encountering this issue. 4) This was an inconsistency between the allele sorting in the genotypes vs read_assignment files, I have now updated the outputs such that the allele assignment number will be the same across both output files. To note, the alleles are sorted by size so all homozygous calls will have A1 equal to 0. As for the rows in read_assignments with no cluster assignment: these reads were called as outliers and were thus dropped before the clustering step so they were not assigned to a cluster. If you would like all reads to be clustered, the 'auto' peakcalling_method can be overridden to 'gmm' or 'kde' to prevent 'kde_throw_outliers', which automatically discards reads with copy numbers that lie outside of the IQR of the data.
Hope this helps and please let me know if you have any other questions, Kinsey
Hi Kinsey,
Thank you for your quick response to my issue; everything was fixed. I tried HMMSTR on several of my own datasets, and the experience was amazing.
However, I have several questions:
I am wondering why the copy number of allele 1 is 0 while that of allele 2 is 264.0, rather than having only allele 1 with a copy number of 264.0.
read_assignments.tsv shows:
However, all the reads were assigned to allele one. Additionally, I noticed that some reads have a blank in the cluster_assignments field. What does this mean, and how should I interpret these reads?
Could you also provide a brief explanation of the "freq" column (the frequency of the copy number for the assigned target)? I supposed that this value should be less than 1, but in my results, the values are greater than 1.
Any suggestions and comments would be appreciated.
Best regards, Hsin