Problems when building the Reference(T2T-genome)

dg520 commented 4 months ago

Reposted from the original question of @MG-DYM

Hi Sir,I want to use T2T-genome and its transcripts.gtf to builde the Reference.It's going wrong at <Phase 2: Mapability Calculation>: <Phase 1: STAR Reference Preparation> Jul 04 21:37:17 ..... started STAR run Jul 04 21:37:17 ... starting to generate Genome files Jul 04 21:38:30 ... starting to sort Suffix Array. This may take a long time... Jul 04 21:38:46 ... sorting Suffix Array chunks and saving them to disk... Jul 04 21:51:05 ... loading chunks from disk, packing SA... Jul 04 21:52:30 ... finished generating suffix array Jul 04 21:52:30 ... generating Suffix Array index Jul 04 21:56:23 ... completed Suffix Array index Jul 04 21:56:23 ..... processing annotations GTF Jul 04 21:56:58 ..... inserting junctions into the genome indices Jul 04 21:59:54 ... writing Genome to disk ... Jul 04 22:00:24 ... writing Suffix Array to disk ... Jul 04 22:04:27 ... writing SAindex to disk Jul 04 22:04:41 ..... finished successfully <Phase 2: Mapability Calculation> Jul 04 22:04:41 ... mapping genome fragments back to genome... Jul 04 22:21:57 ... sorting aligned genome fragments... [bam_sort_core] merging from 36 files and 36 in-memory blocks... Jul 04 22:29:45 ... indexing aligned genome fragments... Jul 04 22:30:24 ... filtering aligned genome fragments by chromosome/scaffold... Jul 04 22:31:05 ... merging filtered genome fragments... lzop: : not a lzop file Mapability build: Failed!

I checked the software version and FAST's and GTF's chromosome name,both are ok.Is there something wrong with T2T-geonme's transcripts.gtf?I first used the origin T2T-geonme fasta and its transcripts.gtf,whose chromosome name is like 'NC_060925.1',it also has a problem: Could you help me find out the problem? Thanks for your help,Best wishes!

dg520 commented 4 months ago

Hi @MG-DYM

Sorry for coming late. From the error messages, there are multiple issues.

In the first half where you showed me the failure in Phase 2, this was due to that lzop failed to decompress files. To fix this, first make sure you're using the latest IRFinder v1.3.1. Then I'd recommend you comment out lines 21-24 in the file bin/until/Mapability and re-run. This will ensure the compression algorithm to be gzip instead of lzop. To be specific, lines 21-24 refer to the following lines and should be commented out:

if [ -x /usr/bin/lzop ]; then                                                                                                                                                                                  
TMPCMP=/usr/bin/lzop                                                                                                                                                                                         
TMPEXT=lzo
fi

In the screenshot, the error message indicated that the system couldn't locate a virtual file /dev/fd/62. That virtual file "packages" a standard input (e,g,, a string) into a file-like structure so that a program requiring a file as its input knows how to deal with the situation. This is standard in all Linux systems. While figuring out what exactly happens on your system is a bit hard, the question is whether you run the command in root mode, either starting the command with sudo or logging in as a root user. Please DO NOT run IRFinder in root mode. Otherwise, "/dev/fd/62 not found" will be the destination.
I haven't worked on T2T genome. But you mentioned it has chromosome name is like 'NC_060925.1'. That is totally fine. You don't need to rename it, as long as the chromosome nomenclature is consistent between GTF and FASTA. IRFinder will take care of it.

Please let me know if any of the above solution fix your issue.

MG-DYM commented 4 months ago

Thanks for you advice,It's very nice of you.But I still have the problem when I comment out lines 21-24 in the file bin/until/Mapability: 微信图片_20240718172441 maybe something wrong with my genome.fa and GTF,I will use hg38 to build the reference to check it.

dg520 commented 4 months ago

@MG-DYM Now this is a new yet much clearer error message to help us understand what is going on. I believe the true error is in the second last step, namely filtering aligned genome fragments by chromosome/scaffold.

In this step, IRFinder extracts 100% mapped reads from a bam file that is successfully generated in the previous step and saves them into compressed BED files. IRFinder does it chromosome by chromosome, so that each of the output .bed.gz file is named after chromosome names according to ${IRFinder_REF}/STAR/chrNames.txt generated in Phase I.

Then, in the step you encountered error, IRFinder carries out a trivial merge of those .bed.gz files. But because one of the .bed.gz ends unexpectedly (i.e. the file is incomplete), it raises to the error message you saw.

We need to figure out why file incompletion happens:

Make sure there is enough storage space both on the disk and for your username.
How many chromosomes are there in the T2T FASTA? You can check it in ${IRFinder_REF}/STAR/chrNames.txt. Linux system has a) a limitation of how many files can be opened and written to the disk at the same time; and b) a limitation of the maximal number of files can be contained in a single folder. If there are too many chromosomes, IRFinder might break these rules, as it tries to take advantage of parallel and open all chromsome writing at once.
To ensure we won't break Rule a) above, we can turn off parallel, by changing and hard-coding $THREADS to 1 in Line 51 of bin/util/Mapability. Save and re-run. Note, this won't solve insufficient disk or Rule b) issues.
To help your debug further, I would also suggest to commenting out Lines 78 to 82 (i.e. the final five lines) in bin/util/Mapability and save and re-run. This will keep the temp files such as those .bed.gz files. We can further nail down where the failure occurs.

MG-DYM commented 3 months ago

It seems that it produced a empty genome_fragments.unsorted.bed.You are right,the true error is in the second last step.And the temp folder is emtpy too though I comment out Lines 78 to 82 (i.e. the final five lines) in bin/util/Mapability. So maybe something wrong with my software or software environment？The error that the system couldn't locate a virtual file /dev/fd/62 happens on another system.That computer didn't have the error in the second last step.Best wishes to you!

dg520 commented 3 months ago

@MG-DYM Thank you for the updates!
Further questions:

On your original machine, did you run it with changes of Line 51 in bin/util/Mapability, as I suggested previously, to turn off multithreading? If so, can you also try to run the command xargs --version and tell me what is printed on the screen?
On the other machine that you didn't encounter the Phase 2 , did it generate bed.gz files in the tmp_${RANDOM} folder?

MG-DYM commented 3 months ago

I got the problem.The error happens at Phase 2,the sam file's chromosome name got a wrong prefix: So that It could not make a right index for the bam.So may be something wrong with STAR ?I have used different versions,but I got the same problem.Wait for you response,Best wishes!

dg520 commented 3 months ago

@MG-DYM I didn't see the problem of chromosomes' names. The 1st column of BAM/SAM refers to read names, in this case, RF!chr1 and RR!chr1 are dummy read names made by IRFinder Phase 2. The 2nd column is SAM Flag and looks good. The 3rd column is chromosome name, in your screenshot, chr1 looks totally normal to me.

Here are my questions:

Is this screenshot from genome_fragments.sam or genome_fragments.bam? If the former, can you also check whether genome_fragments.bam exists and intact, meaning it should have the same number of lines as SAM? And also, does genome_fragments.bam.bai exist?

If both BAM and its index are there and fine, the problem is almost certain in the next step. Can you go to the Mapability folder and manually run the following:

# Make sure you are in the "Mapability" folder
TMPCMP=gzip
TMPEXT=gz
TMPBED=tmp_test
cat ../STAR/chrName.txt | \
xargs --max-args 1 --max-procs 1 -I{} bash -c \
" \
samtools view genome_fragments.bam {} | \
  awk -v tmpdir=\"$TMPBED\" -v tmpcmp=\"$TMPCMP\" -v tmpext=\"$TMPEXT\" \
  ' \
    BEGIN{FS=\"[\\t!]\"; OFS=\"\\t\"} \
    { \
      if ((\$8 == \"70M\") && (\$3 == \$6) && (\$2 == \$5)) \
        {print \$5, \$6-1, \$6+69 | \(tmpcmp \" -c1 > \" tmpdir \"/\" \$5 \".bed.\" tmpext ) } \
    } \ 
   END{close( (tmpcmp \" -c1 > \" tmpdir \"/\" \$5 \".bed.\" tmpext ))} \
  ' \
"

Does this generate any error? If not, this will write many files into the tmp_test folder. Check how many files there. The number should match the number of lines in ../STAR/chrName.txt. Let me know how this goes.

MG-DYM commented 6 days ago

Hi Sir,I successfully run IRFinder and got the result.Is this result OK?Do you have some tips about the result? 7445996cafd17e7613e3ae3c66624fd Best wishes!

dg520 commented 6 days ago

@MG-DYM Looks good to me. The last line indicates your paired-end reads have a reverse-forward directionality for R1 and R2, indicating your RNASeq is likely from Illumina.

MG-DYM commented 5 days ago

Thanks for your advice.It's very nice of you!

dgaolab / IRFinder

Problems when building the Reference(T2T-genome) #1