RitchieLabIGH / IRFinder

MIT License
13 stars 10 forks source link

builiding reference posting to both IRFinder githubs #24

Open pillailab opened 1 year ago

pillailab commented 1 year ago

Hello, I have been trying to build a new reference for hg19 (my bam files are hg19) and i have tried the STAR variation, singularity install, older 1.3 version - all of it just runs for several days and times out. Can you let me know what may help ? Down is the latest iteration Launching reference build process. The full build might take hours. <Phase 1: STAR Reference Preparation> STAR --runMode genomeGenerate --genomeDir /home/mp758/project/GenomeBuilds/IRfinder/IRFinder_hg19/STAR --genomeFastaFiles /home/mp758/project/GenomeBuilds/IRfinder/IRFinder_hg19/genome.fa --sjdbGTFfile /home/mp758/project/GenomeBuilds/IRfinder/IRFinder_hg19/transcripts.gtf --sjdbOverhang 150 --runThreadN 20 &>> log-star-build-ref.log STAR version: 2.7.10a compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source May 17 15:15:07 ..... started STAR run May 17 15:15:07 ... starting to generate Genome files May 17 15:15:47 ..... processing annotations GTF May 17 15:16:14 ... starting to sort Suffix Array. This may take a long time... May 17 15:16:31 ... sorting Suffix Array chunks and saving them to disk... May 17 15:34:31 ... loading chunks from disk, packing SA... May 17 15:35:45 ... finished generating suffix array May 17 15:35:45 ... generating Suffix Array index May 17 15:39:31 ... completed Suffix Array index May 17 15:39:32 ..... inserting junctions into the genome indices May 17 15:43:32 ... writing Genome to disk ... May 17 15:43:35 ... writing Suffix Array to disk ... May 17 15:43:56 ... writing SAindex to disk May 17 15:43:59 ..... finished successfully <Phase 2: Mapability Calculation> May 17 15:44:00 ... mapping genome fragments back to genome... slurm-42586915.out (END)

Thanks Manoj

CloXD commented 1 year ago

Hello Manoj, Sorry to hear about this inconvenience. The problem is related to the generation of the mappability mask. To replicate your issue, I would need to know which reference genome and annotation format and version you are using ( GENCODE or ensemble ). Cheers, Claudio

pillailab commented 1 year ago

Hi Claudio, I am using hg19 - the files are gencode.v19.annotation.gtf and GRCh37.p13.genome.fa If I can generate it from the esmembl site I am happy to do that - not sure which gtf files to use ? thanks manoj

CloXD commented 1 year ago

Ok, thanks, I'll try to reproduce the error and get back to you asap. Cheers, Claudio

CloXD commented 1 year ago

Hello Manoj, I found the problem: the reference genome of GENCODE GRCh37 has all the sequences of each chromosome in one line, and this causes a problem in one of the Perl scripts. The GRCh38 of GENCODE and the ENSEMBL ones divide the sequences in 60 bp for each line. To solve the issue, you can use the following snippet that emulates what the command "IRFinder BuildRefDownload" does, removing the unconvential chromosomes, filtering the gtf and, in addition, this snippet divides the nucleotide sequence in lines of 60 bp each.

reference_gtf=$(realpath gencode.v19.annotation.gtf )
reference_genome=$(realpath GRCh37.p13.genome.fa )
IRF_REF=${OUTDIR}/IRFinderRef/
mkdir -p $IRF_REF

cat $reference_genome | awk -v basedir="${IRF_REF}" 'BEGIN {FS=" "; OFS=" "; OUTPUT=0}
  ($0 ~ /^>/ ) {OUTPUT=0; print >> basedir"/chrs.all.list"}
  ($0 ~ /^>chr[0-9XYM]+/  ) {OUTPUT=1; print substr($1,2)}
  (OUTPUT==1) { if ( $0 ~ /^>/ ) { print > basedir"/genome.fa" } else { for ( i=1; i <= length($0) ; i+=60 ) { print substr($0, i, 60) > basedir"/genome.fa" }  }   }' > ${IRF_REF}/chrs.selected.list

awk -v basedir="${IRF_REF}" 'BEGIN { FS="\t" } NR==FNR { arr[$0]=1 } NR > FNR && ( arr[$1] || $0 ~ /^#/ ) { print } ' ${IRF_REF}/chrs.selected.list  $reference_gtf  > ${IRF_REF}/transcripts.gtf 

singularity exec -e ${IR_IMG} IRFinder \
        BuildRefProcess  \
          -t ${THREADS} -n 100 \
          -r ${IRF_REF} 

I tested only the reference preparation, and it works smoothly. Cheers, Claudio