ZijieJin / scFusion

Other
12 stars 7 forks source link

FusionCandidate error #9

Closed FredoJones closed 2 years ago

FredoJones commented 2 years ago

Hi, great tool! I am trying to set it up on a single cell RNA experiment from one sample. I have 5 fastq renamed according to your nomenclature. 1_1.fastq 1_2.fastq 2_1.fastq 2_2.fastq 3_1.fastq 3_2.fastq 4_1.fastq 4_2.fastq 5_1.fastq 5_2.fastq

The steps up to ReadProcessing work without errors. I am loading the tools partially through conda and from modules in my server. In particular I load samtools from module as there are issues installing it through conda. In the conda environment I keep:

-tensorflow 2.8.0 cpu_py39h4655687_0 conda-forge -scipy 1.8.1 py39he49c0e8_0 conda-forge -numpy 1.22.3 py39hc58783e_2 conda-forge -star 2.7.10a h9ee0642_0 bioconda -pysam 0.19.0 py39h5030a8b_0 bioconda -pyensembl 2.0.0 pyh5e36f6f_0 bioconda -keras 2.8.0 pyhd8ed1ab_0 conda-forge -bedtools 2.30.0 h468198e_3 bioconda I know this does not match exactly your package description in the manual but certain versions of some packages cannot be installed without updating others. Would these slight variations responsible for the errors below?

During the genome indexing step i get this error while the command still completes the task:

`/gpfs/home/projects/analisi_vdj/alfredo.marchetti/sc_SC001/utils/fusionenv/lib/python3.9/site-packages/gtfparse/read_gtf.py:82: FutureWarning: The error_bad_lines argument has been deprecated and will be removed in a future version. Use on_bad_lines in the future.

chunk_iterator = pd.read_csv( /gpfs/home/projects/analisi_vdj/alfredo.marchetti/sc_SC001/utils/fusionenv/lib/python3.9/site-packages/gtfparse/read_gtf.py:82: FutureWarning: The warn_bad_lines argument has been deprecated and will be removed in a future version. Use on_bad_lines in the future.

chunk_iterator = pd.read_csv( But the main issue is when running FusionCandidate: Starting: 1 Candidate Size: 0 Found Size: 0 Starting: 2 Candidate Size: 16246 Found Size: 16246 Starting: 3 Candidate Size: 22217 Found Size: 22203 Starting: 4 Candidate Size: 25999 Found Size: 25969 Starting: 5 Candidate Size: 28898 Found Size: 28856 Traceback (most recent call last): File "/home/users/alfredo.marchetti.stud/analisi_vdj/alfredo.marchetti/sc_SC001/utils/scFusion-2.0.2//bin//PreProcessing_SingleFile.py", line 50, in Data1[index,:,0] = np.array([int(c) for c in ChimericRead[index].upper().replace('A','0').replace('T','1').replace('C','2').replace('G','3').replace('H','4')]) ValueError: could not broadcast input array from shape (27,) into shape (61,) ` Could you provide some guidance? I apologize if the report is not complete, please let me know if you need additional info. Greetings

ZijieJin commented 2 years ago

It looks like the error occurs during the deep learning step. Do you mind share your input with me by email?

FredoJones commented 2 years ago

The reference genomes are: -hg19.fasta -gtf file from https://www.gencodegenes.org/human/release_19.html

the output folder for the scripts contains the following repos: ChiDist ChimericOut Expr fastq scFusionIndex scripts sniffer STARIndex STARMapping utils

the script I launch is:

`module load R/3.6.0 module load genetics/broadinstitute source ~/.bashrc conda activate /home/users/alfredo.marchetti.stud/analisi_vdj/alfredo.marchetti/sc_SC001/utils/fusionenv

python /home/users/alfredo.marchetti.stud/analisi_vdj/alfredo.marchetti/sc_SC001/utils/scFusion-2.0.2/scFusion.py FusionCandidate \ -d /home/users/alfredo.marchetti.stud/analisi_vdj/alfredo.marchetti/sc_SC001/scFusionIndex \ -b 1 \ -e 5 \ -o /home/users/alfredo.marchetti.stud/analisi_vdj/alfredo.marchetti/sc_SC001 ` I will share with you the fastqs as soon as I get clearance. Would you need anything else?

ZijieJin commented 2 years ago

Looks fine. And also check all intermediate files are not empty

FredoJones commented 2 years ago

It seems that all the folders that are generated are not empty. The fastq files look like this:

@A00721:422:HHH2WDSX3:3:1101:2826:1000 1:N:0:CCAAGATG NTCGTAACATTCTCATACTTCTTCAG + #FFFFFFFFFFFFFFFFFFFF:FFFF @A00721:422:HHH2WDSX3:3:1101:7148:1000 1:N:0:CCAAGATG NTGGCAATCTGTGCAAACCTGGGGAA + #FFFFFFFFFFFFFFFFFFFFFFFFF @A00721:422:HHH2WDSX3:3:1101:7744:1000 1:N:0:CCAAGATG NCACGGATCATCTGCCAATATGTCCT + #FFFFFFFFFFFFFFFFFFF:FFFFF @A00721:422:HHH2WDSX3:3:1101:8106:1000 1:N:0:CCAAGATG NTGCTTCGTCTAGCGCGGCAGGTGTA + #FFFFFFFFFFFFFFFFFFF:F:F,F @A00721:422:HHH2WDSX3:3:1101:9607:1000 1:N:0:CCAAGATG NACTTGTCACGAAACGACCATAAATC `

The output of the ChiDist folder looks different from previous attempts:

21M May 26 23:42 ChiDist_middle.txt                                                                                                                               
2.3M May 26 23:42 FusionRead.txt                                                                                                                                   
20M May 26 23:36 Homo.txt                                                                                                                                         
128 May 26 19:36 Reads.npy                                                                                                                                        
128 May 26 19:36 Reads_rev.npy 
ZijieJin commented 2 years ago

Y. It seems that the deep learning data was not expectedly generated. Or could you send me the files in CHiDist folder?

FredoJones commented 2 years ago

I sent it via email at jinzijie@pku.edu.cn

FredoJones commented 2 years ago

My FASTQ are 10x while this tool seem to work only with smartseq data. Is there any chance this could work on my 10x?

ZijieJin commented 2 years ago

Our tool was optimized for Smart-Seq data rather than 10X. While you can run scFusion in a 10X dataset, the performance may be poor.