prevent earlGrey from modifying the original input genome file

estolle commented 9 months ago

Hi there,

I just found these lines in the early stages of earlGrey where it preppes the input genome

        cp ${genome} ${genome}.bak && gzip -f ${genome}.bak 
        sed -i '/>/ s/ .*//g; /^$/d' ${genome}

I was not aware that the original file is being modified. I think this behaviour is dangerous and as the user doesnt really notice this. I would also want to avoid to make too many copies of the input, hence I think modifying this behavior would be better, ie. modify a copy and then do the next steps until the genome.prep

TobyBaril commented 9 months ago

Hi, Thanks for checking out earl grey!

Indeed one copy of the input genome is modified for compatibility with the downstream tools, however the line above shows exactly what is happening - a copy of the original genome is made and compressed, saved with the .bak.gz extension. essentially we then work on a copy of this file, without deleting the original, which if I understand is the behaviour you are suggesting.

If this backup version is uncompressed using gunzip, the original genome still exists. The reason for this is to create an unmodified backup to prevent data loss, but to compress this so it takes less storage space. Nothing is deleted by the pipeline, including intermediate files, so that users have access to any data they may need. We leave the deletion of files up to the discretion of the user.

This is also made clear in point 1 in the Implementation section of the manuscript: https://www.biorxiv.org/content/10.1101/2022.06.30.498289v2.full

estolle commented 9 months ago

I understand that the original is kept as bak.gz I meant the that then the original is modified like this (sed -i) : sed -i '/>/ s/ .*//g; /^$/d' ${genome} Hence after earlGrey finished, the previous INPUT.fa is modified.

I think this can lead to problems and could easily avoided with a temporal copy earlGrey is working on and gzipping for storage later.

I modified my earlGrey like this:

# Subprocess PrepGenome #
prepGenome()
{
    if [ ! -L ${genome} ]; then
        genome=$(realpath ${genome})
    fi
    if [ ! -f ${genome}.prep ] || [ ! -f ${genome}.dict ]; then
        cp ${genome} ${genome}.bak && gzip -f ${genome}.bak 
# original  # sed -i '/>/ s/ .*//g; /^$/d' ${genome}
# added     
    sed '/>/ s/ .*//g; /^$/d' ${genome} > ${genome}.earlGrey.tmp.fa
    pigz -f ${genome}.earlGrey.tmp.fa
#original   # ${SCRIPT_DIR}/headSwap.sh -i ${genome} -o ${genome}.prep
#added
    ${SCRIPT_DIR}/headSwap.sh -i ${genome}.earlGrey.tmp.fa -o ${genome}.prep
        sed -i '/^>/! s/[DVHBPE]/N/g' ${genome}.prep
        dict=${genome}.dict
        genome=${genome}.prep
 else
        dict=${genome}.dict
        genome=${genome}.prep
    fi
}

TobyBaril commented 9 months ago

Patched in v4.0.4

estolle commented 9 months ago

thank you very much!

estolle commented 9 months ago

The fix caused a small problm down the line when R is invokes and the ${genome}.dict file is not found. Since that file is generated from ${genome}.tmp, the created file is dict=${genome}.tmp.dict But then in Line 69 in earlGrey: dict=${genome}.dict I did this to fix it: mv ${genome}.tmp.dict ${genome}.dict dict=${genome}.dict

another quick Q for the RepeatCraft step: it say its missing a mapfile but it continues anyways. Where is such a file coming from?

   <<< Running RepeatCraft >>>

Step 1: Reformating GFF... Parsing LTR_FINDER GFF... Step 2: Labelling short TEs... Missing mapfile, use unite size for all TEs except simple repeat, low complexity and satellite. Step 3: Labelling LTR groups...

TobyBaril commented 9 months ago

Oops! Patched in 4.0.5!

TobyBaril / EarlGrey

prevent earlGrey from modifying the original input genome file #81