Open ptranvan opened 5 years ago
i think it can typically be a pretty slow process
one way around it would be to go in with a small subset of reads, and then increase it as needed. the patterns that you're looking for could be present in as little as 50,000 reads, but if that doesn't work up it to 250,000, which will really speed up the grep search
hope that helps!
Thanks @cryancampbell, I have actually wrote a python script and it worked (less than a minute).
Now I am looking at the the graph made by your R script (cent_seek_graph.pdf) but I am not sure what should I see.
Do you have an example please ?
this is older work, so i don't have a particularly strong example (aside from the one in Figure 5 of our paper - https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-017-0439-6)
here's a test file that i've marked up to show the patterns that i would investigate (attached) cent_seek_graph_wNotes.pdf
the green circles could show a 17bp and then a 34bp repeat (with the true repeat being 17, but sometimes it is flagged at 2x17 = 34) orange would be 48 & 96ish pink 60 & 120ish
if you see anything like that you can pull those hits out from trf and compare to see if there is an underlying pattern
hope that helps!
1)
About the figure 5: there is something that confused me:
The Y-axis is labelled as "Repeat length". If I understood well, it's the number of time the monomer is repeated right ? (Copy number in trf).
2)
About this sentence (in your paper):
The resulting TRF output was mined using custom scripts (https://github.com/cryancampbell/ centromere_seeker) to extract all repeats having a minimum length of 10 bp, minimum tandem repeat unit of 4, and a minimum percent similarity of 70% (across the core monomer).
What does that mean:
a) minimum tandem ? is it the monomer ?
b) minimum length ? is it the total repeat length (so in that case need at least 3 tandem repeat of 4bp)
c) minimum percent similarity ? is it the Percent of matches between adjacent copies overall.
given by trf ? (https://tandem.bu.edu/trf/trf.definitions.html#table)
1) yes, that is correct, "Number of Repeats" is probably a more clear y-axis label
2) a) would be the "Number of Repeats" (y-axis on the graph) b) the core monomer length (x-axis on the graph) c) yes, the trf value of percent matching
basically, as a filter, we looked at repeats that had a core monomer of 10bp and had at least 4x repetitions (so a minimum of 40bp of sequence data)
Thanks @cryancampbell, I have actually wrote a python script and it worked (less than a minute). @ptranvan I have the same question now. Seven days only converted 10% of my reads. Could you let me know what the python script you have used for it? Thank you very much! Now I am looking at the the graph made by your R script (cent_seek_graph.pdf) but I am not sure what should I see.
Do you have an example please ?
Hello,
I am trying your script version 2.1 and I am at the step
converting .dat file into searchable .csv
, sotrf
is done.But it looks like it is extremely slow.
I have almost 5 millions reads to analyse :
There are only 3000 sequences that have been converted to .csv in 3 days :
Is is something you faced before ? and is there any way to speed up the process ?
Thanks for your help.