ExpressionAnalysis / STAR-SEQR

RNA Fusion Detection and Quantification
Other
16 stars 12 forks source link

[STAR-SEQR; Docker] ERROR - Exception: ('too many values to unpack', u'occurred at index **') #31

Open ReWeda opened 1 year ago

ReWeda commented 1 year ago

System Info:

STAR-SEQR version= 0.6.7
Docker version (Server/client): 20.10.12
OS: Ubuntu 20.04.5 LTS x86_64

I'm running star-seqr via the docker image: eagenomics/starseqr (Image ID: 0e715fd07246)

First I generated the STAR Reference Genome Index running STAR inside the docker image (v. STAR_2.5.3a_modified). No problem arose.

Then I run star-seqr providing as input fastq.gz files from one of my sample as follow:

docker run -v $storage_dir:/data \
                -v $reference_dir:/reference \
                -v $temporary_dir:/output \
                eagenomics/starseqr starseqr.py \
                -1 /data/$id'_R1.fastq.gz' \
                -2 /data/$id'_R2.fastq.gz' \
                -p /output/starseqr/$id \
                -i /reference/fusions/Homo_sapiens_assembly38-Index_STARSEQR_ReadLength50 \
                -g /reference/gencode.v27.primary_assembly.annotation.gtf \
                -r /reference/Homo_sapiens_assembly38.fasta \
                -t 16 -m 1 -vv

I already checked for the correctness of the provided paths once resolved the variables in use. The STAR-SEQR run ends when applying the function "apply_jxn_strand"

I copy below the content of the LOG file produced by the run to provide more details about the error:

2023-01-30 10:43 - INFO -

2023-01-30 10:43 - INFO - ################################################################################
2023-01-30 10:43 - INFO - #                             0 01/30/23  10:43:06                             #
2023-01-30 10:43 - INFO - ################################################################################
2023-01-30 10:43:06 - starseqr - INFO - ***************STAR-SEQR******************
2023-01-30 10:43:06 - starseqr - INFO - CMD = /opt/conda/bin/starseqr.py -1 /data/606_R1.fastq.gz -2 /data/606_R2.fastq.gz -p /output/starseqr/606
 -i /reference/fusions/Homo_sapiens_assembly38-Index_STARSEQR_ReadLength50 -g /reference/gencode.v27.primary_assembly.annotation.gtf -r /reference
/Homo_sapiens_assembly38.fasta -t 16 -m 1 -vv
2023-01-30 10:43:06 - starseqr - INFO - STAR-SEQR_version = 0.6.7
2023-01-30 10:43:06 - starseqr - INFO - Starting to work on sample: /output/starseqr/606
2023-01-30 10:43:06 - starseqr - INFO - Found input: /data/606_R1.fastq.gz
2023-01-30 10:43:06 - starseqr - INFO - Found input: /data/606_R2.fastq.gz
2023-01-30 10:43:06 - starseqr - INFO - Found input: /reference/Homo_sapiens_assembly38.fasta
2023-01-30 10:43:06 - starseqr - INFO - Found input: /reference/gencode.v27.primary_assembly.annotation.gtf
2023-01-30 10:43:06 - star_funcs - INFO - Starting STAR Alignment
2023-01-30 10:43:06 - star_funcs - INFO - *STAR Command: STAR --readFilesIn /data/606_R1.fastq.gz /data/606_R2.fastq.gz --readFilesCommand zcat --
runThreadN 16 --genomeDir /reference/fusions/Homo_sapiens_assembly38-Index_STARSEQR_ReadLength50 --outFileNamePrefix  /output/starseqr/606_STAR-SE
QR/606. --chimScoreJunctionNonGTAG -1 --outSAMtype None --chimOutType SeparateSAMold --alignSJDBoverhangMin 5 --outFilterMultimapScoreRange 1 --ou
tFilterMultimapNmax 5 --outMultimapperOrder Random --outSAMattributes NH HI AS nM ch --chimSegmentMin 10 --chimJunctionOverhangMin 10 --chimScoreM
in 1 --chimScoreDropMax 30 --chimScoreSeparation 7 --chimSegmentReadGapMax 3 --chimFilter None --twopassMode None --alignSJstitchMismatchNmax 5 -1
 5 5 --chimMainSegmentMultNmax 10
2023-01-30 10:46:43 - star_funcs - INFO - Jan 30 10:43:06 ..... started STAR run
Jan 30 10:43:06 ..... loading genome
Jan 30 10:43:43 ..... started mapping
Jan 30 10:46:43 ..... finished successfully

2023-01-30 10:46:43 - star_funcs - INFO - STAR Alignment Finished!
2023-01-30 10:46:43 - core - INFO - Importing junctions
2023-01-30 10:46:44 - core - INFO - Number of candidates removed due to Mitochondria filter: 1500
2023-01-30 10:46:44 - core - INFO - Removing duplicate reads
2023-01-30 10:46:44 - common - INFO - Begin multiprocessing of function apply_cigar_overhang in a pool of 16 workers using map_async protocol
2023-01-30 10:46:44 - common - DEBUG - *The dataframe will be split evenly across the 16 workers
2023-01-30 10:46:44 - common - DEBUG - *Initializing a map_async pool with 16 workers
2023-01-30 10:46:45 - common - DEBUG - *Time to run pandas_parallel on apply_cigar_overhang took 0.464632 seconds
2023-01-30 10:47:05 - starseqr - INFO - Ordering junctions
2023-01-30 10:47:05 - starseqr - INFO - Normalizing junctions
2023-01-30 10:47:05 - common - INFO - Begin multiprocessing of function apply_normalize_jxns in a pool of 16 workers using map_async protocol
2023-01-30 10:47:05 - common - DEBUG - *The dataframe will be split evenly across the 16 workers
2023-01-30 10:47:05 - common - DEBUG - *Initializing a map_async pool with 16 workers
2023-01-30 10:47:05 - common - DEBUG - *Time to run pandas_parallel on apply_normalize_jxns took 0.559973 seconds
2023-01-30 10:47:05 - starseqr - INFO - Getting gene strand and flipping info as necessary
2023-01-30 10:47:05 - common - INFO - Begin multiprocessing of function apply_jxn_strand in a pool of 16 workers using map_async protocol
2023-01-30 10:47:05 - common - DEBUG - *The dataframe will be split evenly across the 16 workers
2023-01-30 10:47:05 - common - DEBUG - *Initializing a map_async pool with 16 workers
2023-01-30 10:47:19 - common - ERROR - Exception: ('too many values to unpack', u'occurred at index 1820')

Moreover, last file produced in the output directory is: *_STAR-SEQR_breakpoints.txt. It only contains the header.

Any help about how to fix this error?

Further info: I previously run star-seqr successfully on simulated reads data. Given the returned error, the only difference I can think of between real and simulated read files is the naming schema of the reads themselves. I adopted a super basic naming schema for simulated data (ie. reads1000\1 and reads1000\2) while the naming schema for real data is the one produced by illumina sequencers. May this difference be the cause of the problem I get or it's not relevant?