bvaldebenitom / SoloTE

GNU General Public License v3.0
23 stars 6 forks source link

Mm10 and Hg38 *rm.out files #1

Closed JBreunig closed 1 year ago

JBreunig commented 1 year ago

I was able to find Repeatmasker files on the UCSC website but they don't seem to be in the exact same format and with the *.rm.out name. Could you provide a link? Thanks in advance! Josh

bvaldebenitom commented 1 year ago

Hi @JBreunig,

I noticed you also posted this on the SpatialTE issues. Both tools come packed with a script that converts the *.out RepeatMasker file into the appropriate format required for the pipeline.

For SoloTE the script is named ./convertRMOut_to_SoloTEinput.sh, whereas for SpatialTE the script name is convertRMOut_to_SpatialTEinput.sh, and works the same.

The input for both scripts is the RepeatMasker out file. Here is an example of the mm39.fa.out RepeatMasker out file available at UCSC, and looks like this:

% head mm39.fa.out 
   SW  perc perc perc  query      position in query           matching       repeat              position in  repeat
score  div. del. ins.  sequence    begin     end    (left)    repeat         class/family         begin  end (left)   ID

 3667   6.7  0.0  1.0  chr1      3050294 3050775 (192103504) +  L1MdA_VI       LINE/L1               6094 6570    (6)      1
  242  26.5  2.2  3.9  chr1      3051045 3051146 (192103133) C  ID2            SINE/ID               (48)  104      5      2
   13  11.6  0.0  6.7  chr1      3051160 3051191 (192103088) +  (TGCT)n        Simple_repeat            1   30    (0)      3
  303  28.9 11.2  3.6  chr1      3051232 3051371 (192102908) +  Tigger9a       DNA/TcMar-Tigger        88  239  (493)      4
  579  29.3 11.7  2.2  chr1      3051808 3051992 (192102287) +  Tigger5b_Glire DNA/TcMar-Tigger        39  243  (110)      5
  385  25.7  3.4  4.2  chr1      3052102 3052219 (192102060) +  PB1D10         SINE/Alu                 1  117    (0)      6
   16  10.7  0.0  3.2  chr1      3052862 3052893 (192101386) +  (TATCAA)n      Simple_repeat            1   31    (0)      7

Then, for SoloTE you would need to run the conversion script like this, where mm39_SoloTE.bed is the name you want for your new file: ./convertRMOut_to_SoloTEinput.sh mm39.fa.out mm39_SoloTE.bed

% head mm39_SoloTE.bed
chr1    3050294 3050775 chr1|3050294|3050775|L1MdA_VI:L1:LINE|+ 3667    +
chr1    3051045 3051146 chr1|3051045|3051146|ID2:ID:SINE|-  242 -
chr1    3051232 3051371 chr1|3051232|3051371|Tigger9a:TcMar-Tigger:DNA|+    303 +
chr1    3051808 3051992 chr1|3051808|3051992|Tigger5b_Glire:TcMar-Tigger:DNA|+  579 +
chr1    3052102 3052219 chr1|3052102|3052219|PB1D10:Alu:SINE|+  385 +
chr1    3053936 3054012 chr1|3053936|3054012|L1MC4:L1:LINE|-    247 -
chr1    3054039 3054829 chr1|3054039|3054829|L1MB4:L1:LINE|-    1624    -
chr1    3054917 3055526 chr1|3054917|3055526|Lx8b:L1:LINE|- 1976    -
chr1    3055671 3055792 chr1|3055671|3055792|B1F1:Alu:SINE|-    316 -
chr1    3056082 3056260 chr1|3056082|3056260|L1MB4:L1:LINE|-    316 -

Hope this helps!

JBreunig commented 1 year ago

Thank you! Any plans on creating a scATAC version of this package?