DrosophilaGenomeEvolution / TrEMOLO

Transposable Elements MOvement detection using LOng reads
GNU General Public License v3.0
18 stars 5 forks source link

grep: out/OUTSIDER/TE_DETECTION/MERGE_TE/tmp_ID_TrEMOLO.txt:1: Invalid range end #6

Closed cgroza closed 1 year ago

cgroza commented 1 year ago

Hi,

I am encountering another error, this time on a separate human dataset (not drosophila as in issue #5). It happens in rule MERGE_TE:

rule MERGE_TE:
    input: out/OUTSIDER/TE_DETECTION/FILTER_BLAST_SEQUENCE_INDEL_vs_DBTE.csv, out/OUTSIDER/TrEMOLO_SV_TE/SOFT/SOFT_TE.csv, out/OUTSIDER/TrEMOLO_SV_TE/INS/INS_TREMOLO.bed
    output: out/OUTSIDER/TE_DETECTION/MERGE_TE/MERGE_TE_ALL.bed, out/POSITION_TE_OUTSIDER.bed, out/tmp_TrEMOLO_output_rule/rule_tmp_MERGE_TE_out
    log: out/log/MERGE_TE
    jobid: 12

grep: out/OUTSIDER/TE_DETECTION/MERGE_TE/tmp_ID_TrEMOLO.txt:1: Invalid range end
grep: out/OUTSIDER/TE_DETECTION/MERGE_TE/tmp_ID_TrEMOLO.txt:2: Invalid range end
grep: out/OUTSIDER/TE_DETECTION/MERGE_TE/tmp_ID_TrEMOLO.txt:3: Invalid range end
grep: out/OUTSIDER/TE_DETECTION/MERGE_TE/tmp_ID_TrEMOLO.txt:4: Invalid range end
grep: out/OUTSIDER/TE_DETECTION/MERGE_TE/tmp_ID_TrEMOLO.txt:5: Invalid range end
... many more "Invalid range end"
Error in rule MERGE_TE:
    jobid: 12
    output: out/OUTSIDER/TE_DETECTION/MERGE_TE/MERGE_TE_ALL.bed, out/POSITION_TE_OUTSIDER.bed, out/tmp_TrEMOLO_output_rule/rule_tmp_MERGE_TE_out
    log: out/log/MERGE_TE (check log file(s) for error message)
    shell:

The human genome is HG002 with 30X Pacbio long reads. Any advice on what could be causing this?

My thanks, Cristian Groza

M-D75 commented 1 year ago

Hi,

The cause of the problem could be that some of the names in the transposable element database look like bad regular expressions (ex: [TE-755), the grep command tries to interpret regular expressions instead of seeing it as a fixed string. I could change this by adding the -F option but the grep command is used repeatedly in the pipeline so it is necessary to have some time to check the lines where the -F option is useful to avoid unpleasant surprises.

Just to be sure, for Pacbio long reads. Did you change the PRESET_OPTION minimap2 m̀at-ont -> m̀ap-pb in the config.yaml file

...

PARAMS:
    THREADS: 8 #number of threads for some task
    OUTSIDER_VARIANT:
        MINIMAP2:
            PRESET_OPTION: 'map-pb' 
...

Thanks for this report, Mourdas

cgroza commented 1 year ago

Ok I checked for regular expression special characters and changed them to _. Indeed, it is a challenge to curate the TE database names for such cases. And yes I switched to map-pb. Will rerun and see how it goes!

Thanks, Cristian