Closed estolle closed 9 months ago
Hi, Thanks for checking out earl grey!
Indeed one copy of the input genome is modified for compatibility with the downstream tools, however the line above shows exactly what is happening - a copy of the original genome is made and compressed, saved with the .bak.gz
extension. essentially we then work on a copy of this file, without deleting the original, which if I understand is the behaviour you are suggesting.
If this backup version is uncompressed using gunzip
, the original genome still exists. The reason for this is to create an unmodified backup to prevent data loss, but to compress this so it takes less storage space. Nothing is deleted by the pipeline, including intermediate files, so that users have access to any data they may need. We leave the deletion of files up to the discretion of the user.
This is also made clear in point 1 in the Implementation section of the manuscript: https://www.biorxiv.org/content/10.1101/2022.06.30.498289v2.full
I understand that the original is kept as bak.gz
I meant the that then the original is modified like this (sed -i) : sed -i '/>/ s/ .*//g; /^$/d' ${genome}
Hence after earlGrey finished, the previous INPUT.fa is modified.
I think this can lead to problems and could easily avoided with a temporal copy earlGrey is working on and gzipping for storage later.
I modified my earlGrey like this:
# Subprocess PrepGenome #
prepGenome()
{
if [ ! -L ${genome} ]; then
genome=$(realpath ${genome})
fi
if [ ! -f ${genome}.prep ] || [ ! -f ${genome}.dict ]; then
cp ${genome} ${genome}.bak && gzip -f ${genome}.bak
# original # sed -i '/>/ s/ .*//g; /^$/d' ${genome}
# added
sed '/>/ s/ .*//g; /^$/d' ${genome} > ${genome}.earlGrey.tmp.fa
pigz -f ${genome}.earlGrey.tmp.fa
#original # ${SCRIPT_DIR}/headSwap.sh -i ${genome} -o ${genome}.prep
#added
${SCRIPT_DIR}/headSwap.sh -i ${genome}.earlGrey.tmp.fa -o ${genome}.prep
sed -i '/^>/! s/[DVHBPE]/N/g' ${genome}.prep
dict=${genome}.dict
genome=${genome}.prep
else
dict=${genome}.dict
genome=${genome}.prep
fi
}
Patched in v4.0.4
thank you very much!
The fix caused a small problm down the line when R is invokes and the ${genome}.dict file is not found. Since that file is generated from ${genome}.tmp, the created file is dict=${genome}.tmp.dict But then in Line 69 in earlGrey: dict=${genome}.dict I did this to fix it: mv ${genome}.tmp.dict ${genome}.dict dict=${genome}.dict
another quick Q for the RepeatCraft step: it say its missing a mapfile but it continues anyways. Where is such a file coming from?
<<< Running RepeatCraft >>>
Step 1: Reformating GFF... Parsing LTR_FINDER GFF... Step 2: Labelling short TEs... Missing mapfile, use unite size for all TEs except simple repeat, low complexity and satellite. Step 3: Labelling LTR groups...
Oops! Patched in 4.0.5!
Hi there,
I just found these lines in the early stages of earlGrey where it preppes the input genome
I was not aware that the original file is being modified. I think this behaviour is dangerous and as the user doesnt really notice this. I would also want to avoid to make too many copies of the input, hence I think modifying this behavior would be better, ie. modify a copy and then do the next steps until the genome.prep