cryancampbell / centromere_seeker

bash script to search for centromeric repeat patterns in PacBio data, using several current tools (trf and R)
12 stars 7 forks source link

converting trf results to csv slow ? #1

Open ptranvan opened 5 years ago

ptranvan commented 5 years ago

Hello,

I am trying your script version 2.1 and I am at the step converting .dat file into searchable .csv, so trf is done.

But it looks like it is extremely slow.

I have almost 5 millions reads to analyse :

grep 'Sequence' input.fasta.2.5.7.80.10.50.2000.dat -c
4878463

There are only 3000 sequences that have been converted to .csv in 3 days :

awk -F, '{print $1}' trf_all_hits.csv | sort | uniq | wc -l
3086

Is is something you faced before ? and is there any way to speed up the process ?

Thanks for your help.

cryancampbell commented 5 years ago

i think it can typically be a pretty slow process

one way around it would be to go in with a small subset of reads, and then increase it as needed. the patterns that you're looking for could be present in as little as 50,000 reads, but if that doesn't work up it to 250,000, which will really speed up the grep search

hope that helps!

ptranvan commented 5 years ago

Thanks @cryancampbell, I have actually wrote a python script and it worked (less than a minute).

Now I am looking at the the graph made by your R script (cent_seek_graph.pdf) but I am not sure what should I see.

Do you have an example please ?

cryancampbell commented 5 years ago

this is older work, so i don't have a particularly strong example (aside from the one in Figure 5 of our paper - https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-017-0439-6)

here's a test file that i've marked up to show the patterns that i would investigate (attached) cent_seek_graph_wNotes.pdf

the green circles could show a 17bp and then a 34bp repeat (with the true repeat being 17, but sometimes it is flagged at 2x17 = 34) orange would be 48 & 96ish pink 60 & 120ish

if you see anything like that you can pull those hits out from trf and compare to see if there is an underlying pattern

hope that helps!

ptranvan commented 5 years ago

1)

About the figure 5: there is something that confused me:

The Y-axis is labelled as "Repeat length". If I understood well, it's the number of time the monomer is repeated right ? (Copy number in trf).

2)

About this sentence (in your paper):

The resulting TRF output was mined using custom scripts (https://github.com/cryancampbell/ centromere_seeker) to extract all repeats having a minimum length of 10 bp, minimum tandem repeat unit of 4, and a minimum percent similarity of 70% (across the core monomer).

What does that mean:

a) minimum tandem ? is it the monomer ?

b) minimum length ? is it the total repeat length (so in that case need at least 3 tandem repeat of 4bp)

c) minimum percent similarity ? is it the Percent of matches between adjacent copies overall. given by trf ? (https://tandem.bu.edu/trf/trf.definitions.html#table)

cryancampbell commented 5 years ago

1) yes, that is correct, "Number of Repeats" is probably a more clear y-axis label

2) a) would be the "Number of Repeats" (y-axis on the graph) b) the core monomer length (x-axis on the graph) c) yes, the trf value of percent matching

basically, as a filter, we looked at repeats that had a core monomer of 10bp and had at least 4x repetitions (so a minimum of 40bp of sequence data)

liwang0307 commented 2 years ago

Thanks @cryancampbell, I have actually wrote a python script and it worked (less than a minute). @ptranvan I have the same question now. Seven days only converted 10% of my reads. Could you let me know what the python script you have used for it? Thank you very much! Now I am looking at the the graph made by your R script (cent_seek_graph.pdf) but I am not sure what should I see.

Do you have an example please ?