altemose / NTRprism

9 stars 2 forks source link

NTRprism #2

Open duhuipeng opened 2 years ago

duhuipeng commented 2 years ago

Dear author I applied your script in githup. The monomer structure of the centromere of our own species has been identified, which is about 336bp,(as follow) image the parameters are set as follows(1 1000 10 6 0) But now I first look for the size of it HOR unit ,When I applied your parameters in githup(1 5000 30 6 ), the main peak still appeared at 336bp,there is no peak of HOR(as follow) image

However, it is mentioned in article reports that it HOR is about 3.3-3.4KB(published 2000 years), But we now want to see the results of our in centromere. so I would like to ask how should I modify your code parameters to display the peak value of HOR unit instead of showing the smallest repeating unit

Best

altemose commented 2 years ago

I would first try 1 5000 10 7 0. Increasing the k-mer length can improve sensitivity for longer repeating units.

You might not be picking up the HOR for several reasons. For example, in our simulations we found that high sequence divergence after the formation of HORs can obscure the HOR peak and favor the monomer peak. Furthermore, how consistent is the HOR length across the array? An array with inconsistent lengths would struggle to show a strong HOR peak. It would help to take one of your sequences and make a dotplot with large word size, to see if a strong HOR signal is present (as off-diagonal parallel lines). If you send me your input sequence, I can try a few other tweaks with the latest, unreleased version of NTRprism to see if anything pops out.

duhuipeng commented 2 years ago

Dear author image I would like to consult on how I can see the number of times each peak appears? Because I want to count their distribution right now as follow image I saw it density in the generated file, but where can I see his total counts?That way I can find out the number of times per peak

Best HuipengDu

altemose commented 2 years ago

As I understand your question, you would like to decompose each array into non-overlapping higher-order repeat units and then report the absolute total sequence length contained in higher-order repeat units of each apparent size. NTRprism does not perform alignment-based string decomposition by itself, which would give you a more precise answer to your query (perhaps try StringDecomposer for this once you have a set of monomer classes). However, NTRprism can give you an approximate answer based on the string splitting that it performs. For a bin size of 1, each row in the output matrix file contains the integer count for all fragments of length 1 to span that are produced when splitting the input string at all instances of the k-mer specified in column 1. However, each row represents an independent string splitting operation using a different k-mer, meaning that fragments produced in different rows may be overlapping and cannot be counted independently for the purposes of string decomposition. Also consequently, the peak heights in the column sum spectrum cannot be interpreted as being strictly proportional to the fraction of the array belonging to non-overlapping repeats of length L. So to get an approximate answer to your question, you can search for the best single k-mer that maximizes the count at a particular fragment length, then examine the absolute distribution of fragment counts for that k-mer. The newest version of NTRprism (still under testing) identifies this optimal k-mer and produces a fasta file of all sequence fragments within a certain length range produced by splitting the sequence at that k-mer.