Open dg520 opened 4 months ago
Hi @MG-DYM
Sorry for coming late. From the error messages, there are multiple issues.
In the first half where you showed me the failure in Phase 2, this was due to that lzop
failed to decompress files. To fix this, first make sure you're using the latest IRFinder v1.3.1. Then I'd recommend you comment out lines 21-24 in the file bin/until/Mapability
and re-run. This will ensure the compression algorithm to be gzip
instead of lzop
. To be specific, lines 21-24 refer to the following lines and should be commented out:
if [ -x /usr/bin/lzop ]; then
TMPCMP=/usr/bin/lzop
TMPEXT=lzo
fi
In the screenshot, the error message indicated that the system couldn't locate a virtual file /dev/fd/62
. That virtual file "packages" a standard input (e,g,, a string) into a file-like structure so that a program requiring a file as its input knows how to deal with the situation. This is standard in all Linux systems. While figuring out what exactly happens on your system is a bit hard, the question is whether you run the command in root mode, either starting the command with sudo
or logging in as a root user. Please DO NOT run IRFinder in root mode. Otherwise, "/dev/fd/62 not found" will be the destination.
I haven't worked on T2T genome. But you mentioned it has chromosome name is like 'NC_060925.1'. That is totally fine. You don't need to rename it, as long as the chromosome nomenclature is consistent between GTF and FASTA. IRFinder will take care of it.
Please let me know if any of the above solution fix your issue.
Thanks for you advice,It's very nice of you.But I still have the problem when I comment out lines 21-24 in the file bin/until/Mapability: maybe something wrong with my genome.fa and GTF,I will use hg38 to build the reference to check it.
@MG-DYM Now this is a new yet much clearer error message to help us understand what is going on.
I believe the true error is in the second last step, namely filtering aligned genome fragments by chromosome/scaffold
.
In this step, IRFinder extracts 100% mapped reads from a bam file that is successfully generated in the previous step and saves them into compressed BED files. IRFinder does it chromosome by chromosome, so that each of the output .bed.gz
file is named after chromosome names according to ${IRFinder_REF}/STAR/chrNames.txt
generated in Phase I.
Then, in the step you encountered error, IRFinder carries out a trivial merge of those .bed.gz
files. But because one of the .bed.gz
ends unexpectedly (i.e. the file is incomplete), it raises to the error message you saw.
We need to figure out why file incompletion happens:
Make sure there is enough storage space both on the disk and for your username.
How many chromosomes are there in the T2T FASTA? You can check it in ${IRFinder_REF}/STAR/chrNames.txt
. Linux system has a) a limitation of how many files can be opened and written to the disk at the same time; and b) a limitation of the maximal number of files can be contained in a single folder. If there are too many chromosomes, IRFinder might break these rules, as it tries to take advantage of parallel and open all chromsome writing at once.
To ensure we won't break Rule a) above, we can turn off parallel, by changing and hard-coding $THREADS
to 1
in Line 51 of bin/util/Mapability
. Save and re-run. Note, this won't solve insufficient disk or Rule b) issues.
To help your debug further, I would also suggest to commenting out Lines 78 to 82 (i.e. the final five lines) in bin/util/Mapability
and save and re-run. This will keep the temp files such as those .bed.gz
files. We can further nail down where the failure occurs.
It seems that it produced a empty genome_fragments.unsorted.bed.You are right,the true error is in the second last step.And the temp folder is emtpy too though I comment out Lines 78 to 82 (i.e. the final five lines) in bin/util/Mapability. So maybe something wrong with my software or software environment?The error that the system couldn't locate a virtual file /dev/fd/62 happens on another system.That computer didn't have the error in the second last step.Best wishes to you!
@MG-DYM Thank you for the updates!
Further questions:
bin/util/Mapability
, as I suggested previously, to turn off multithreading? If so, can you also try to run the command xargs --version
and tell me what is printed on the screen? tmp_${RANDOM}
folder? I got the problem.The error happens at Phase 2,the sam file's chromosome name got a wrong prefix: So that It could not make a right index for the bam.So may be something wrong with STAR ?I have used different versions,but I got the same problem.Wait for you response,Best wishes!
@MG-DYM I didn't see the problem of chromosomes' names. The 1st column of BAM/SAM refers to read names, in this case, RF!chr1
and RR!chr1
are dummy read names made by IRFinder Phase 2. The 2nd column is SAM Flag and looks good. The 3rd column is chromosome name, in your screenshot, chr1
looks totally normal to me.
Here are my questions:
genome_fragments.sam
or genome_fragments.bam
? If the former, can you also check whether genome_fragments.bam
exists and intact, meaning it should have the same number of lines as SAM? And also, does genome_fragments.bam.bai
exist? Mapability
folder and manually run the following:
# Make sure you are in the "Mapability" folder
TMPCMP=gzip
TMPEXT=gz
TMPBED=tmp_test
cat ../STAR/chrName.txt | \
xargs --max-args 1 --max-procs 1 -I{} bash -c \
" \
samtools view genome_fragments.bam {} | \
awk -v tmpdir=\"$TMPBED\" -v tmpcmp=\"$TMPCMP\" -v tmpext=\"$TMPEXT\" \
' \
BEGIN{FS=\"[\\t!]\"; OFS=\"\\t\"} \
{ \
if ((\$8 == \"70M\") && (\$3 == \$6) && (\$2 == \$5)) \
{print \$5, \$6-1, \$6+69 | \(tmpcmp \" -c1 > \" tmpdir \"/\" \$5 \".bed.\" tmpext ) } \
} \
END{close( (tmpcmp \" -c1 > \" tmpdir \"/\" \$5 \".bed.\" tmpext ))} \
' \
"
Does this generate any error? If not, this will write many files into the tmp_test
folder. Check how many files there. The number should match the number of lines in ../STAR/chrName.txt
. Let me know how this goes.
Hi Sir,I successfully run IRFinder and got the result.Is this result OK?Do you have some tips about the result? Best wishes!
@MG-DYM Looks good to me. The last line indicates your paired-end reads have a reverse-forward directionality for R1 and R2, indicating your RNASeq is likely from Illumina.
Thanks for your advice.It's very nice of you!
Reposted from the original question of @MG-DYM
Hi Sir,I want to use T2T-genome and its transcripts.gtf to builde the Reference.It's going wrong at <Phase 2: Mapability Calculation>: <Phase 1: STAR Reference Preparation> Jul 04 21:37:17 ..... started STAR run Jul 04 21:37:17 ... starting to generate Genome files Jul 04 21:38:30 ... starting to sort Suffix Array. This may take a long time... Jul 04 21:38:46 ... sorting Suffix Array chunks and saving them to disk... Jul 04 21:51:05 ... loading chunks from disk, packing SA... Jul 04 21:52:30 ... finished generating suffix array Jul 04 21:52:30 ... generating Suffix Array index Jul 04 21:56:23 ... completed Suffix Array index Jul 04 21:56:23 ..... processing annotations GTF Jul 04 21:56:58 ..... inserting junctions into the genome indices Jul 04 21:59:54 ... writing Genome to disk ... Jul 04 22:00:24 ... writing Suffix Array to disk ... Jul 04 22:04:27 ... writing SAindex to disk Jul 04 22:04:41 ..... finished successfully <Phase 2: Mapability Calculation> Jul 04 22:04:41 ... mapping genome fragments back to genome... Jul 04 22:21:57 ... sorting aligned genome fragments... [bam_sort_core] merging from 36 files and 36 in-memory blocks... Jul 04 22:29:45 ... indexing aligned genome fragments... Jul 04 22:30:24 ... filtering aligned genome fragments by chromosome/scaffold... Jul 04 22:31:05 ... merging filtered genome fragments... lzop: : not a lzop file Mapability build: Failed!
I checked the software version and FAST's and GTF's chromosome name,both are ok.Is there something wrong with T2T-geonme's transcripts.gtf?I first used the origin T2T-geonme fasta and its transcripts.gtf,whose chromosome name is like 'NC_060925.1',it also has a problem: Could you help me find out the problem? Thanks for your help,Best wishes!