TE-analysis_Shuffle_bed.pl_V4.4 output empty

crazyhottommy commented 5 years ago

Hi, I was trying this script, it prints the steps on screen but the output file empty with only the headers:

#Script TE-analysis_Shuffle_bed.pl, v4.4
#Aggregated results + stats
#Features in input file (counts):
        64833
#With 500 bootstraps for exp (expected); sd = standard deviation; nb = number; len = length; avg = average
#Two tests are made (permutation and binomial) to assess how significant the difference between observed and ran
dom, so two pvalues are given
#For the two tailed permutation test:
#if rank is < 250 and pvalue is not "ns", there are significantly fewer observed values than expected 
#if rank is > 250 and pvalue is not "ns", there are significantly higher observed values than expected 
#The binomial test is done with binom.test from R, two sided

#Level_(tot_means_all)  #       #       #COUNTS #       #       #       #       #       #       #       #      #
        #       #       #       #
#Rclass Rfam    Rname   obs_hits        %_obs_(%of_features)    obs_tot_hits    nb_of_trials(nb_of_TE_in_genome)
        exp_avg_hits    exp_sd  %_exp_(%of_features)    exp_tot_hits(avg)       obs_rank_in_exp 2-tailed_permuta
tion-test_pvalue(obs.vs.exp)    significance    binomal_test_proba      binomial_test_95%_confidence_interval  b
inomial_test_pval       significance

any idea what's wrong?

Thanks!

4ureliek commented 5 years ago

No clue! I can debut things if you send me (small) files that reproduce the issue and your exact command line.

crazyhottommy commented 5 years ago

Thanks for offering help.

The files can be downloaded at https://drive.google.com/open?id=1evrSNYQkOJYlIf4J9HiDa4WRnHDEhLly

command I used:

perl TE-analysis_Shuffle_bed.pl -f CON_peaks_clean.bed -q LTR_clean.bed -s bed -r hg38_chromSize.txt -e hg38_UCSC_gap.bed

crazyhottommy commented 5 years ago

Hi, did you get a chance to look at it? Thanks!

4ureliek commented 5 years ago

Hi,

You sent me a bed file for the RM output, but the code expects the original RM output, to extract class, family, repeat name. The usage states: -q,--query => (STRING) Features to shuffle = TE file For now, can only be the repeat masker .out or the .bed file generated by the TE-analysis_pipeline script If you have a cleaned TE annotation you could 'hack this' by creating a bed file that has the expected structure: chr1 3000001 3000097 14737;8.1;1.0;0.2;chr1;3000001;3000097;(192471874);-;L1MdFanc_I;LINE/L1;(2987);3586;3489;1 . - First 3 columns are the coordinates of the TE in the genome; 4th is all the RM fields separated by ";", 5th is a "." and 6th is the strand. The only RM fields you need are the ones in bold here: 14737;8.1;1.0;0.2;chr1;3000001;3000097;(192471874);-;L1MdFanc_I;LINE/L1;(2987);3586;3489;1 - so you could make something like: chr1 3000001 3000097 0;0;0;0;chr1;3000001;3000097;na;na;L1MdFanc_I;LINE/L1;na;0;0;0 . - and the code will run.

Using this code does not make sense if you don't have repeat names, class and family in the TE file. If you want to test overlaps and do stats without caring for these info, you could check out Giggle, or regioneR package in R...

Another side note, even though it is implemented, the -s bed option is very unlikely to reflect a good expected distribution of TEs in the genome (since they are not randomly distributed in the first place). That is why I implemented -s rm and -s tss, which test slightly different things.

crazyhottommy commented 5 years ago

Thanks @4ureliek these are all good information. I will re-do the analysis.

crazyhottommy commented 5 years ago

BTW, repeat masker .out file is the repeatmask file from UCSC?

4ureliek commented 5 years ago

I would use the one from the Repeat Masker website, they remask organisms with updated libraries (not always the case on UCSC)

crazyhottommy commented 5 years ago

Thanks a billion!

crazyhottommy commented 5 years ago

I prepared a file like this LTR.out

chr1    10000   10468   0;0;0;0;chr1;10000;10468;na;na;Simple_repeat;Simple_repeatna;0;0;0      .       +
chr1    21948   22075   0;0;0;0;chr1;21948;22075;na;na;LTR;ERVL-MaLRna;0;0;0    .       +
chr1    30693   30848   0;0;0;0;chr1;30693;30848;na;na;LTR;ERVL-MaLRna;0;0;0    .       +
chr1    30952   31131   0;0;0;0;chr1;30952;31131;na;na;LTR;ERVL-MaLRna;0;0;0    .       +
chr1    34564   34921   0;0;0;0;chr1;34564;34921;na;na;LTR;ERVL-MaLRna;0;0;0    .       -
chr1    38255   39464   0;0;0;0;chr1;38255;39464;na;na;LTR;ERVL-MaLRna;0;0;0    .       +
chr1    39464   39623   0;0;0;0;chr1;39464;39623;na;na;LTR;ERVL-MaLRna;0;0;0    .       +
chr1    39924   40294   0;0;0;0;chr1;39924;40294;na;na;LTR;ERVL-MaLRna;0;0;0    .       +
chr1    40735   40878   0;0;0;0;chr1;40735;40878;na;na;LTR;ERVLna;0;0;0 .       -
chr1    41393   42273   0;0;0;0;chr1;41393;42273;na;na;LTR;ERVLna;0;0;0 .       -

perl TE-analysis_Shuffle_bed.pl -f CON_peaks_clean.bed -q LTR.out -s bed -r hg38_chromSize.txt -e hg38_UCSC_gap.bed

still empty result file.

4ureliek commented 5 years ago

Hi,

Ha, you need to name this file .bed or the code is trying to make a bed file (named LTR.nonTE-no_low.bed) using the ".out" file you provided (but it's not a .out file!). Thus, that created bed file is empty, thus the output is empty. I should add a warning when that happens (but this is a special case).

Name your repeat file LTR.bed instead (since it's a bed file!) and run: perl TE-analysis_Shuffle_bed.pl -f CON_peaks_clean.bed -q LTR.bed -s bed -r hg38_chromSize.txt -e hg38_UCSC_gap.bed

It gave me data. There were some uninitialized value errors, but I'm expected they disappear when you have more data. Let me know if not!

~A

crazyhottommy commented 5 years ago

thanks again!

crazyhottommy commented 5 years ago

it now gives me some output. how to read it and what is tot?

#Level_(tot_means_all)  #   #   #COUNTS #   #   #   #   #   #   #   #   #   #   #   #   #
#Rclass Rfam    Rname   obs_hits    %_obs_(%of_features)    obs_tot_hits    nb_of_trials(nb_of_TE_in_genome)    exp_avg_hits    exp_sd  %_exp_(%of_features)    exp_tot_hits(avg)   obs_rank_in_exp 2-tailed_permutation-test_pvalue(obs.vs.exp)    significance    binomal_test_proba  binomial_test_95%_confidence_interval   binomial_test_pval  significance

LTRna   tot tot 195 0.300772754615705   104529  3225    313.89  15.0808561501734    0.484151589468326   75570.39    1   0.02    *   0.06046512  0.05248566;0.06925266   5.948e-14   ***
LTRna   LTRna   LTR 195 0.300772754615705   104529  3225    313.89  15.0808561501734    0.484151589468326   75570.39    1   0.02    *   0.06046512  0.05248566;0.06925266   5.948e-14   ***
LTRna   LTRna   tot 195 0.300772754615705   104529  3225    313.89  15.0808561501734    0.484151589468326   75570.39    1   0.02    *   0.06046512  0.05248566;0.06925266   5.948e-14   ***
ERVLna  tot tot 16535   25.5039871670291    104529  167168  16815.12    128.488908455009    25.9360510850956    75570.39    3   0.06    ns  0.09891247  0.09748531;0.10035356   0.0228  *
ERVLna  ERVLna  LTR 16535   25.5039871670291    104529  167168  16815.12    128.488908455009    25.9360510850956    75570.39    3   0.06    ns  0.09891247  0.09748531;0.10035356   0.0228  *
ERVLna  ERVLna  tot 16535   25.5039871670291    104529  167168  16815.12    128.488908455009    25.9360510850956    75570.39    3   0.06    ns  0.09891247  0.09748531;0.10035356   0.0228  *

crazyhottommy commented 5 years ago

I also downloaded the .out file

wget http://www.repeatmasker.org/genomes/hg38/RepeatMasker-rm405-db20140131/hg38.fa.out.gz .

less -S hg38.fa.out | awk 'NR >3' | grep LTR > LTR.out

perl TE-analysis_Shuffle_bed.pl -f CON_peaks_clean.bed -q LTR.out -n 10 -s bed -r hg38_chromSize.txt -e hg38_UCSC_gap.bed

but the LTR.nonTE-no_low.bed is also empty.

4ureliek commented 5 years ago

Hi, To read the output:

first I see that there are "na" attached to the repeat class, might be a ";" missing in the bed file
obs means "observed"
exp means "expected" (based on the type of shuffling that was done)
you can sort on rank (obs_rank_in_exp), and filter on the significance column (the one right after "2-tailed_permutation-test_pvalue(obs.vs.exp)"): keep the ones with stars. The rank of the permutation indicates very well how enriched or not things are (the header of the output has explanations on how to read the rank; 1 means that it is depleted). The significance of the permutation test requires high bootstraps to be useful, but is the best one to use. I added the binomial test that is more sensitive, but tends to give some false positives for low counts.
the 'tot" values are to help seeing if a class is enriched, but unless a specific family is, I would not report it.
note that TEs are generally depleted from any feature, which makes it difficult sometimes to see any enrichment with these shuffling methods.

Hope this helps, Aurelie

4ureliek commented 5 years ago

Hi, I ran exactly your command line to get the RM output file and the LTR.out file. Deleted any previous bedfile from my directory with your test files in it. Ran the command line you shared, and got a non-empty LTR.nonTE-no_low.bed file. Make sure you delete it if it existed, as the code will check for existence and won't regenerate it.

crazyhottommy commented 5 years ago

Thanks very much!

crazyhottommy commented 5 years ago

Hi, after fixing the missing ; I got some sensible results

For LTR:

#Level_(tot_means_all)  #   #   #COUNTS #   #   #   #   #   #   #   #   #   #
#Rclass Rfam    Rname   obs_hits    %_obs_(%of_features)    obs_tot_hits    nb_of_trials(nb_of_TE_in_genome)    exp_avg_hits    exp_sd  %_exp_(%of_features)    exp_tot_hits(avg)   obs_rank_in_exp 2-tailed_permutation-test_pvalue(obs.vs.exp)    significance    binomal_test_proba  binomial_test_95%_confidence_interval   binomial_test_pval  significance

LTR LTR tot 195 0.300772754615705   104529  3225    313.16  15.3456603940428    0.483025619668996   75557.72    1   0.02    *   0.06046512  0.05248566;0.06925266   8.748e-14   ***
LTR LTR LTR 195 0.300772754615705   104529  3225    313.16  15.3456603940428    0.483025619668996   75557.72    1   0.02    *   0.06046512  0.05248566;0.06925266   8.748e-14   ***
LTR tot tot 195 0.300772754615705   104529  3225    313.16  15.3456603940428    0.483025619668996   75557.72    1   0.02    *   0.06046512  0.05248566;0.06925266   8.748e-14   ***
tot tot tot 104529  161.228078293462    104529  748597  75557.72    266.548726949471    116.542069625037    75557.72    101 0.02    *   0.1396332   0.1388488;0.1404204 2.2e-16 ***
ERVK    ERVK    tot 4040    6.23139450588435    104529  11411   1267.31 32.4307171388703    1.95472984436938    75557.72    101 0.02    *   0.3540443   0.3452643;0.3628989 2.2e-16 ***
ERVK    ERVK    LTR 4040    6.23139450588435    104529  11411   1267.31 32.4307171388703    1.95472984436938    75557.72    101 0.02    *   0.3540443   0.3452643;0.3628989 2.2e-16 ***
ERVK    tot tot 4040    6.23139450588435    104529  11411   1267.31 32.4307171388703    1.95472984436938    755

For LINE:

#Level_(tot_means_all)  #   #   #COUNTS #   #   #   #   #   #   #   #   #   #
#Rclass Rfam    Rname   obs_hits    %_obs_(%of_features)    obs_tot_hits    nb_of_trials(nb_of_TE_in_genome)    exp_avg_hits    exp_sd  %_exp_(%of_features)    exp_tot_hits(avg)   obs_rank_in_exp 2-tailed_permutation-test_pvalue(obs.vs.exp)    significance    binomal_test_proba  binomial_test_95%_confidence_interval   binomial_test_pval  significance

RTE-BovB    RTE-BovB    tot 447 0.689463699042154   201095  9120    882.9   25.6083556439667    1.36180648743695    161135.72   1   0.02    *   0.04901316  0.04467207;0.05364511   2.2e-16 ***
RTE-BovB    RTE-BovB    LINE    447 0.689463699042154   201095  9120    882.9   25.6083556439667    1.36180648743695    161135.72   1   0.02    *   0.04901316  0.04467207;0.05364511   2.2e-16 ***
RTE-BovB    tot tot 447 0.689463699042154   201095  9120    882.9   25.6083556439667    1.36180648743695    161135.72   1   0.02    *   0.04901316  0.04467207;0.05364511   2.2e-16 ***
age cat.1   tot 201095  310.173831227924    201095      161135.72   358.652643637097    248.539663443   161135.72   101 0.02    *   na  na  na  na
age cat.1       201095  310.173831227924    201095      161135.72   358.652643637097    248.539663443   161135.72   101 0.02    *   na  na  na  na
age cat.2   tot 201095  310.173831227924    201095      161135.72   358.652643637097    248.539663443   161135.72   101 0.02    *   na  na  na  na
CR1 CR1 tot 3539    5.45863989017938    201095  68182   6645.1  73.1270514691086    10.2495642651119    161135.72   1   0.02    *   0.05190519  0.05025184;0.05359681   2.2e-16 ***
CR1 CR1 LINE    3539    5.45863989017938    201095  68182   6645.1  73.1270514691086    10.2495642651119    161135.72   1   0.02    *   0.05190519  0.05025184;0.05359681   2.2e-16 ***
CR1 tot tot 3539    5.45863989017938    201095  68182   6645.1  73.1270514691086    10.2495642651119    161135.72   1   0.02    *   0.05190519  0.05025184;0.05359681   2.2e-16 ***
L1  L1  LINE    168262  259.531411472553    201095  1001410 104706.36   259.905636333612    161.501642681968    161135

I am still a bit confused about the tot in the first three columns. If I want to know if LTR and LINE is enriched or not in my features. which rows should I check?

basically, I want to make a bar graph like below in

https://www.ncbi.nlm.nih.gov/pubmed/28052240

4ureliek commented 5 years ago

Hi, There is something wrong in your TE names though; you have L1 as the class and family but LINE at the name. That might create weird behavior...

For the columns, they should look like something like this: #Rclass Rfam Rname LINE L1 L1Hs means all the L1Hs fragments LINE L1 tot means all the L1 fragments LINE tot tot means all the LINE fragments

4ureliek / TEanalysis

TE-analysis_Shuffle_bed.pl_V4.4 output empty #3