celphin / RepeatOBserverV1

An R package to visualize chromosome scale repeat patterns and predict centromere locations.
https://www.biorxiv.org/content/10.1101/2023.12.30.573697v1
19 stars 1 forks source link

Small typo in post-repeats.sh when creating Centromere_summary_Shannon.txt and question to output files #8

Open lpettrich opened 2 months ago

lpettrich commented 2 months ago

Hi! First of all: Thank you for your tool - it is great! 😄

I noticed a small typo and just wanted to point it out. I tried to create the subfiles:

grep "cent25 " Centromere_summary_Shannon.txt > ${SPP_Hap}_Centromere_summary_Shannon_25.txt
grep "cent100 " Centromere_summary_Shannon.txt > ${SPP_Hap}_Centromere_summary_Shannon_100.txt
grep "cent250 " Centromere_summary_Shannon.txt > ${SPP_Hap}_Centromere_summary_Shannon_250.txt
grep "cent500 " Centromere_summary_Shannon.txt > ${SPP_Hap}_Centromere_summary_Shannon_500.txt
grep "cent1000 " Centromere_summary_Shannon.txt > ${SPP_Hap}_Centromere_summary_Shannon_1000.txt
grep "centwind " Centromere_summary_Shannon_35_no_telo.txt > ${SPP_Hap}_Centromere_summary_Shannon_wind_35_no_telo.txt

And noticed it is not working. Ich checked your script post-repeats.sh and noticed that it is saved as ${SPP_Hap}_Centromere_summary_Shannon.txt

So the for creation of the subfiles you would need to use

grep "cent25 " ${SPP_Hap}_Centromere_summary_Shannon.txt > ${SPP_Hap}_Centromere_summary_Shannon_25.txt
grep "cent100 " ${SPP_Hap}_Centromere_summary_Shannon.txt > ${SPP_Hap}_Centromere_summary_Shannon_100.txt
grep "cent250 " ${SPP_Hap}_Centromere_summary_Shannon.txt > ${SPP_Hap}_Centromere_summary_Shannon_250.txt
grep "cent500 " ${SPP_Hap}_Centromere_summary_Shannon.txt > ${SPP_Hap}_Centromere_summary_Shannon_500.txt
grep "cent1000 " ${SPP_Hap}_Centromere_summary_Shannon.txt > ${SPP_Hap}_Centromere_summary_Shannon_1000.txt
grep "centwind " ${SPP_Hap}_Centromere_summary_Shannon.txt > ${SPP_Hap}_Centromere_summary_Shannon_wind_35_no_telo.txt

It is just something small I catched while your using your script. But it is already great that you provide all these scripts.

Additionally I also have a questing regarding the output-files: The number in the file extension gives the number of windows, correct?

I think I am a bit confused how these files relate to the directories in Summary_output/Shannon_div. Could you perhaps explain it to me? So which txt-file relates to which directory? And how did you define the cut-off for the telomeres?

Thank you!

Cheers, Laura

celphin commented 3 weeks ago

Hi Laura,

Sorry for the slow response and thanks for pointing out the typo! I have fixed the script now.

As for your questions: >The number in the file extension gives the number of windows, correct? Yes, the number in the file extension is the number of 5000 bp windows that the rolling mean is calculated over.

>I think I am a bit confused how these files relate to the directories in Summary_output/Shannon_div. Could you perhaps explain it to me? So which txt-file relates to which directory? The text file is named based on the number of 5kbp windows and the folders are named by the number of base pairs. Sorry I can see how this could be confusing. Thus the folders and files should align as such: SPP_H0-AT_Centromere_summary_Shannon_1000.txt Shannon_div_5Mbp
SPP_H0-AT_Centromere_summary_Shannon_500.txt Shannon_div_2.5Mbp SPP_H0-AT_Centromere_summary_Shannon_250.txt Shannon_div_1.25Mbp SPP_H0-AT_Centromere_summary_Shannon_100.txt Shannon_div_500kbp
SPP_H0-AT_Centromere_summary_Shannon_25.txt NA NA This is plotting every window Shannon_div_5kbp SPP_H0-AT_Centromere_summary_Shannon_wind_35_no_telo.txt Shannon_div_window

>And how did you define the cut-off for the telomeres? The telomeres are not cutoff in the Shannon diversity plots but they are cut off in the histograms where the first and last 2Mbp bins are not plotted.

Best wishes, Cassandra