Closed crazyhottommy closed 5 years ago
No clue! I can debut things if you send me (small) files that reproduce the issue and your exact command line.
Thanks for offering help.
The files can be downloaded at https://drive.google.com/open?id=1evrSNYQkOJYlIf4J9HiDa4WRnHDEhLly
command I used:
perl TE-analysis_Shuffle_bed.pl -f CON_peaks_clean.bed -q LTR_clean.bed -s bed -r hg38_chromSize.txt -e hg38_UCSC_gap.bed
Hi, did you get a chance to look at it? Thanks!
Hi,
You sent me a bed file for the RM output, but the code expects the original RM output, to extract class, family, repeat name.
The usage states:
-q,--query => (STRING) Features to shuffle = TE file For now, can only be the repeat masker .out or the .bed file generated by the TE-analysis_pipeline script
If you have a cleaned TE annotation you could 'hack this' by creating a bed file that has the expected structure:
chr1 3000001 3000097 14737;8.1;1.0;0.2;chr1;3000001;3000097;(192471874);-;L1MdFanc_I;LINE/L1;(2987);3586;3489;1 . -
First 3 columns are the coordinates of the TE in the genome; 4th is all the RM fields separated by ";", 5th is a "." and 6th is the strand. The only RM fields you need are the ones in bold here: 14737;8.1;1.0;0.2;chr1;3000001;3000097;(192471874);-;L1MdFanc_I;LINE/L1;(2987);3586;3489;1 - so you could make something like:
chr1 3000001 3000097 0;0;0;0;chr1;3000001;3000097;na;na;L1MdFanc_I;LINE/L1;na;0;0;0 . -
and the code will run.
Using this code does not make sense if you don't have repeat names, class and family in the TE file. If you want to test overlaps and do stats without caring for these info, you could check out Giggle, or regioneR package in R...
Another side note, even though it is implemented, the -s bed option is very unlikely to reflect a good expected distribution of TEs in the genome (since they are not randomly distributed in the first place). That is why I implemented -s rm and -s tss, which test slightly different things.
Thanks @4ureliek these are all good information. I will re-do the analysis.
BTW, repeat masker .out file is the repeatmask file from UCSC?
I would use the one from the Repeat Masker website, they remask organisms with updated libraries (not always the case on UCSC)
Thanks a billion!
I prepared a file like this LTR.out
chr1 10000 10468 0;0;0;0;chr1;10000;10468;na;na;Simple_repeat;Simple_repeatna;0;0;0 . +
chr1 21948 22075 0;0;0;0;chr1;21948;22075;na;na;LTR;ERVL-MaLRna;0;0;0 . +
chr1 30693 30848 0;0;0;0;chr1;30693;30848;na;na;LTR;ERVL-MaLRna;0;0;0 . +
chr1 30952 31131 0;0;0;0;chr1;30952;31131;na;na;LTR;ERVL-MaLRna;0;0;0 . +
chr1 34564 34921 0;0;0;0;chr1;34564;34921;na;na;LTR;ERVL-MaLRna;0;0;0 . -
chr1 38255 39464 0;0;0;0;chr1;38255;39464;na;na;LTR;ERVL-MaLRna;0;0;0 . +
chr1 39464 39623 0;0;0;0;chr1;39464;39623;na;na;LTR;ERVL-MaLRna;0;0;0 . +
chr1 39924 40294 0;0;0;0;chr1;39924;40294;na;na;LTR;ERVL-MaLRna;0;0;0 . +
chr1 40735 40878 0;0;0;0;chr1;40735;40878;na;na;LTR;ERVLna;0;0;0 . -
chr1 41393 42273 0;0;0;0;chr1;41393;42273;na;na;LTR;ERVLna;0;0;0 . -
perl TE-analysis_Shuffle_bed.pl -f CON_peaks_clean.bed -q LTR.out -s bed -r hg38_chromSize.txt -e hg38_UCSC_gap.bed
still empty result file.
Hi,
Ha, you need to name this file .bed
or the code is trying to make a bed file (named LTR.nonTE-no_low.bed
) using the ".out" file you provided (but it's not a .out file!). Thus, that created bed file is empty, thus the output is empty. I should add a warning when that happens (but this is a special case).
Name your repeat file LTR.bed instead (since it's a bed file!) and run:
perl TE-analysis_Shuffle_bed.pl -f CON_peaks_clean.bed -q LTR.bed -s bed -r hg38_chromSize.txt -e hg38_UCSC_gap.bed
It gave me data. There were some uninitialized value errors, but I'm expected they disappear when you have more data. Let me know if not!
~A
thanks again!
it now gives me some output. how to read it and what is tot?
#Level_(tot_means_all) # # #COUNTS # # # # # # # # # # # # #
#Rclass Rfam Rname obs_hits %_obs_(%of_features) obs_tot_hits nb_of_trials(nb_of_TE_in_genome) exp_avg_hits exp_sd %_exp_(%of_features) exp_tot_hits(avg) obs_rank_in_exp 2-tailed_permutation-test_pvalue(obs.vs.exp) significance binomal_test_proba binomial_test_95%_confidence_interval binomial_test_pval significance
LTRna tot tot 195 0.300772754615705 104529 3225 313.89 15.0808561501734 0.484151589468326 75570.39 1 0.02 * 0.06046512 0.05248566;0.06925266 5.948e-14 ***
LTRna LTRna LTR 195 0.300772754615705 104529 3225 313.89 15.0808561501734 0.484151589468326 75570.39 1 0.02 * 0.06046512 0.05248566;0.06925266 5.948e-14 ***
LTRna LTRna tot 195 0.300772754615705 104529 3225 313.89 15.0808561501734 0.484151589468326 75570.39 1 0.02 * 0.06046512 0.05248566;0.06925266 5.948e-14 ***
ERVLna tot tot 16535 25.5039871670291 104529 167168 16815.12 128.488908455009 25.9360510850956 75570.39 3 0.06 ns 0.09891247 0.09748531;0.10035356 0.0228 *
ERVLna ERVLna LTR 16535 25.5039871670291 104529 167168 16815.12 128.488908455009 25.9360510850956 75570.39 3 0.06 ns 0.09891247 0.09748531;0.10035356 0.0228 *
ERVLna ERVLna tot 16535 25.5039871670291 104529 167168 16815.12 128.488908455009 25.9360510850956 75570.39 3 0.06 ns 0.09891247 0.09748531;0.10035356 0.0228 *
I also downloaded the .out file
wget http://www.repeatmasker.org/genomes/hg38/RepeatMasker-rm405-db20140131/hg38.fa.out.gz .
less -S hg38.fa.out | awk 'NR >3' | grep LTR > LTR.out
perl TE-analysis_Shuffle_bed.pl -f CON_peaks_clean.bed -q LTR.out -n 10 -s bed -r hg38_chromSize.txt -e hg38_UCSC_gap.bed
but the LTR.nonTE-no_low.bed
is also empty.
Hi, To read the output:
Hope this helps, Aurelie
Hi, I ran exactly your command line to get the RM output file and the LTR.out file. Deleted any previous bedfile from my directory with your test files in it. Ran the command line you shared, and got a non-empty LTR.nonTE-no_low.bed file. Make sure you delete it if it existed, as the code will check for existence and won't regenerate it.
Thanks very much!
Hi, after fixing the missing ; I got some sensible results
For LTR:
#Level_(tot_means_all) # # #COUNTS # # # # # # # # # #
#Rclass Rfam Rname obs_hits %_obs_(%of_features) obs_tot_hits nb_of_trials(nb_of_TE_in_genome) exp_avg_hits exp_sd %_exp_(%of_features) exp_tot_hits(avg) obs_rank_in_exp 2-tailed_permutation-test_pvalue(obs.vs.exp) significance binomal_test_proba binomial_test_95%_confidence_interval binomial_test_pval significance
LTR LTR tot 195 0.300772754615705 104529 3225 313.16 15.3456603940428 0.483025619668996 75557.72 1 0.02 * 0.06046512 0.05248566;0.06925266 8.748e-14 ***
LTR LTR LTR 195 0.300772754615705 104529 3225 313.16 15.3456603940428 0.483025619668996 75557.72 1 0.02 * 0.06046512 0.05248566;0.06925266 8.748e-14 ***
LTR tot tot 195 0.300772754615705 104529 3225 313.16 15.3456603940428 0.483025619668996 75557.72 1 0.02 * 0.06046512 0.05248566;0.06925266 8.748e-14 ***
tot tot tot 104529 161.228078293462 104529 748597 75557.72 266.548726949471 116.542069625037 75557.72 101 0.02 * 0.1396332 0.1388488;0.1404204 2.2e-16 ***
ERVK ERVK tot 4040 6.23139450588435 104529 11411 1267.31 32.4307171388703 1.95472984436938 75557.72 101 0.02 * 0.3540443 0.3452643;0.3628989 2.2e-16 ***
ERVK ERVK LTR 4040 6.23139450588435 104529 11411 1267.31 32.4307171388703 1.95472984436938 75557.72 101 0.02 * 0.3540443 0.3452643;0.3628989 2.2e-16 ***
ERVK tot tot 4040 6.23139450588435 104529 11411 1267.31 32.4307171388703 1.95472984436938 755
For LINE:
#Level_(tot_means_all) # # #COUNTS # # # # # # # # # #
#Rclass Rfam Rname obs_hits %_obs_(%of_features) obs_tot_hits nb_of_trials(nb_of_TE_in_genome) exp_avg_hits exp_sd %_exp_(%of_features) exp_tot_hits(avg) obs_rank_in_exp 2-tailed_permutation-test_pvalue(obs.vs.exp) significance binomal_test_proba binomial_test_95%_confidence_interval binomial_test_pval significance
RTE-BovB RTE-BovB tot 447 0.689463699042154 201095 9120 882.9 25.6083556439667 1.36180648743695 161135.72 1 0.02 * 0.04901316 0.04467207;0.05364511 2.2e-16 ***
RTE-BovB RTE-BovB LINE 447 0.689463699042154 201095 9120 882.9 25.6083556439667 1.36180648743695 161135.72 1 0.02 * 0.04901316 0.04467207;0.05364511 2.2e-16 ***
RTE-BovB tot tot 447 0.689463699042154 201095 9120 882.9 25.6083556439667 1.36180648743695 161135.72 1 0.02 * 0.04901316 0.04467207;0.05364511 2.2e-16 ***
age cat.1 tot 201095 310.173831227924 201095 161135.72 358.652643637097 248.539663443 161135.72 101 0.02 * na na na na
age cat.1 201095 310.173831227924 201095 161135.72 358.652643637097 248.539663443 161135.72 101 0.02 * na na na na
age cat.2 tot 201095 310.173831227924 201095 161135.72 358.652643637097 248.539663443 161135.72 101 0.02 * na na na na
CR1 CR1 tot 3539 5.45863989017938 201095 68182 6645.1 73.1270514691086 10.2495642651119 161135.72 1 0.02 * 0.05190519 0.05025184;0.05359681 2.2e-16 ***
CR1 CR1 LINE 3539 5.45863989017938 201095 68182 6645.1 73.1270514691086 10.2495642651119 161135.72 1 0.02 * 0.05190519 0.05025184;0.05359681 2.2e-16 ***
CR1 tot tot 3539 5.45863989017938 201095 68182 6645.1 73.1270514691086 10.2495642651119 161135.72 1 0.02 * 0.05190519 0.05025184;0.05359681 2.2e-16 ***
L1 L1 LINE 168262 259.531411472553 201095 1001410 104706.36 259.905636333612 161.501642681968 161135
I am still a bit confused about the tot
in the first three columns. If I want to know if LTR and LINE is enriched or not in my features. which rows should I check?
basically, I want to make a bar graph like below in
Hi, There is something wrong in your TE names though; you have L1 as the class and family but LINE at the name. That might create weird behavior...
For the columns, they should look like something like this:
#Rclass Rfam Rname
LINE L1 L1Hs
means all the L1Hs fragments
LINE L1 tot
means all the L1 fragments
LINE tot tot
means all the LINE fragments
Hi, I was trying this script, it prints the steps on screen but the output file empty with only the headers:
any idea what's wrong?
Thanks!