ZijieJin / scFusion

Other
12 stars 7 forks source link

Generated empty files and caused an error when running with Testdata #5

Closed MasayukiNagai closed 3 years ago

MasayukiNagai commented 3 years ago

Hi,

When I ran the program using Testdata, it caused an error when executing ./bin/CombinePipeline_Retrain.sh as follows.

$ python scFusion.py \
     -f Testdata/Testdata/ \
     -o TestOut/ \
     -b 1 -e 10 -t 20 \
     -s <path>/STARIndex/  \
     -g <path>/hg19.fa \
     -a <path>/hg19.ncbiRefSeq.gtf

Traceback (most recent call last):
  File "scFusion.py", line 273, in <module>
    aaa = subprocess.check_output(
  File "/<path>/.pyenv/versions/3.8.11/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/<path>/.pyenv/versions/3.8.11/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'sh ./bin/CombinePipeline_Retrain.sh TestOut/ . ./bin/../data/weight-V9-2.hdf5 10 ./bin/' returned non-zero exit status 1. 

Then, I ran the shell script alone to see what the problem was.

$ sh ./bin/CombinePipeline_Retrain.sh TestOut/ . ./bin/../data/weight-V9-2.hdf5 10 ./bin

Traceback (most recent call last):
  File "./bin/Data_preprocess_MyRetrain.py", line 29, in <module>
    read_new = read[0:MergePoint]+'H'+read[MergePoint:]
NameError: name 'MergePoint' is not defined

So, I checked the ./bin/Data_preprocess_MyRetrain.py and its input files. It seems that it gives an error because my TestOut/Retrain/ChimericRead.txt is empty and thus MergePoint = int(readinfo_split[1]) fails.

It seems that even before the python script is executed, some of the files generated have no data in them. (The first empty file generated should be {outdir}/ChiDist/Homo.txt if I understand the flow correctly) The following shows the sizes of files in the ChimericOut, ChiDist, and Retrain folders.

# TestOut/ChimericOut/
bit   2026 Sep 17 22:18 1_FusionSupport.txt
bit 182158 Sep 17 22:18 1_geneanno.sam
bit 177442 Sep 17 22:18 1.sam
...

# TestOut/ChiDist/
bit   0 Sep 17 22:18 ChiDist_middle.txt
bit   0 Sep 17 22:18 FusionRead.txt
bit   0 Sep 17 22:18 Homo.txt
bit 128 Sep 17 22:19 Reads.npy
bit 128 Sep 17 22:19 Reads_rev.npy

# TestOut/Retrain/
bit 0 Sep 17 22:19 ChimericRead.txt
bit 0 Sep 17 22:19 SimuRead.txt

What would be the output you expect to get when running the program with Testdata? I would really appreciate it if you could peek into the problem! Also, if you could upload the results you get when you run the program with Testdata, it would be really helpful to compare with what I get. Thank you!

ZijieJin commented 3 years ago

Got your problem. I will check it in three days.

On Sep 18, 2021, at 00:34, Moon @.***> wrote:



Hi,

When I ran the program using Testdata, it caused an error when executing ./bin/CombinePipeline_Retrain.sh as follows.

$ python scFusion.py \ -f Testdata/Testdata/ \ -o TestOut/ \ -b 1 -e 10 -t 20 \ -s /STARIndex/ \ -g /hg19.fa \ -a /hg19.ncbiRefSeq.gtf

Traceback (most recent call last): File "scFusion.py", line 273, in aaa = subprocess.check_output( File "//.pyenv/versions/3.8.11/lib/python3.8/subprocess.py", line 415, in check_output return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, File "//.pyenv/versions/3.8.11/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command 'sh ./bin/CombinePipeline_Retrain.sh TestOut/ . ./bin/../data/weight-V9-2.hdf5 10 ./bin/' returned non-zero exit status 1.

Then, I ran the shell script alone to see what the problem was.

$ sh ./bin/CombinePipeline_Retrain.sh TestOut/ . ./bin/../data/weight-V9-2.hdf5 10 ./bin

Traceback (most recent call last): File "./bin/Data_preprocess_MyRetrain.py", line 29, in read_new = read[0:MergePoint]+'H'+read[MergePoint:] NameError: name 'MergePoint' is not defined

So, I checked the ./bin/Data_preprocess_MyRetrain.py and its input files. It seems that it gives an error because my TestOut/Retrain/ChimericRead.txt is empty and thus MergePoint = int(readinfo_split[1]) fails.

It seems that even before the python script is executed, some of the files generated have no data in them. (The first empty file generated should be {outdir}/ChiDist/Homo.txt if I understand the flow correctly) The following shows the sizes of files in the ChimericOut, ChiDist, and Retrain folders.

TestOut/ChimericOut/

bit 2026 Sep 17 22:18 1_FusionSupport.txt bit 182158 Sep 17 22:18 1_geneanno.sam bit 177442 Sep 17 22:18 1.sam ...

TestOut/ChiDist/

bit 0 Sep 17 22:18 ChiDist_middle.txt bit 0 Sep 17 22:18 FusionRead.txt bit 0 Sep 17 22:18 Homo.txt bit 128 Sep 17 22:19 Reads.npy bit 128 Sep 17 22:19 Reads_rev.npy

TestOut/Retrain/

bit 0 Sep 17 22:19 ChimericRead.txt bit 0 Sep 17 22:19 SimuRead.txt

What would be the output you expect to get when running the program with Testdata? I would really appreciate it if you could peek into the problem! Also, if you could upload the results you get when you run the program with Testdata, it would be really helpful to compare with what I get. Thank you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FZijieJin%2FscFusion%2Fissues%2F5&data=04%7C01%7C%7C78a0439a7d8648fd05a008d979f8f443%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637674932428760809%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=lvdRKdJrov36UhsI%2FtXwihJDV%2FqckS11oF8g0MIvcn8%3D&reserved=0, or unsubscribehttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAFNH4NFPH4SK76HQ7YGKMJLUCNUXPANCNFSM5EH5PWAA&data=04%7C01%7C%7C78a0439a7d8648fd05a008d979f8f443%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637674932428770760%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=zEBecb0JR8sZOODctlpBRGRpQbwhoTBic5Gd1EdfqR4%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7C%7C78a0439a7d8648fd05a008d979f8f443%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637674932428770760%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=cys6t1Qs1%2BQW%2FAK4%2BKB%2FLg52kaTpYzHHrniGb0z51XQ%3D&reserved=0 or Androidhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7C%7C78a0439a7d8648fd05a008d979f8f443%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637674932428770760%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=0zi1G7vd%2B17QtEFIko1DAZDTZtLtiqZ%2F1UKFLPlk31A%3D&reserved=0.

ZijieJin commented 3 years ago

Hi, MasayukiNagai. I found it a little strange here. I could run the script with the expected result. Could you run the third command in the CombinePipeline_startwith_FS.sh file? python ${codedir}/FindHomoPattern_RAM.py ${FilePath}/ChimericOut/${prefix}FusionScore.txt ${hg19file} ${gtf} > ${FilePath}/ChiDist/${prefix}Homo.txt

Let's see what will happen. If it prints 'Bad Line', please show me the gtf file you use. I guess the script could not understand the gtf file so it printed nothing.

MasayukiNagai commented 3 years ago

As you suspected, running the command printed several 'Bad Line's.
I attached the gtf file below that you can also get from this page on the UCSC website. I also tried with hg19.refGene.gtf.gz, only to encounter the same error.

Would you mind telling me where you got your gtf file? (If from Ensemble, which release?)

hg19.ncbiRefSeq.gtf.gz

ZijieJin commented 3 years ago

I have updated the FindHomoPattern_RAM.py file. Please download it and replace the old file. scFusion can run properly with the gtf file you gave me and the new FindHomoPattern_RAM.py file.

MasayukiNagai commented 3 years ago

Thanks to the change you made, the _FindHomoPatternRAM.py file seems to generate the Homo.txt file without any problems.

However, the next step caused the following error and only generated empty _ChiDistmiddle.txt

$ python ./bin/FindChiDist.py TestOut/ChimericOut/ 1 10 TestOut/Expr/ TestOut/ChiDist/Homo.txt . > TestOut/ChiDist/ChiDist_middle.txt

Traceback (most recent call last):
  File "./bin//FindChiDist.py", line 419, in <module>
    thischr2 = chr2num(CandidateList[l][3])
  File "./bin//FindChiDist.py", line 24, in chr2num
    return int(str)
ValueError: invalid literal for int() with base 10: 'Un_gl000220'

It seems that chr2num gives an error because of chrUn_gl000220 in {i}_FusionSupport.txt file. Here is the 1_FusionSupport.txt that has chrUn_gl000220 just in case.

ZijieJin commented 3 years ago

Yes, this is a known issue. I will fix it in the next version (v1.4). To temporarily avoid this bug, please delete all the lines in gtf file where the chromosome is not chr1-chr22 , chrX and chrY.

MasayukiNagai commented 3 years ago

It seems to work now. Just to be sure, I've got empty {i}.rpkm.txt files in the Expr folder. Is that okay?

Also, would it be possible for you to upload the expected results, if not all, that you get by running the Testdata? I personally feel it a little hard to identify an issue when there is one because processes are executed via subprocess and even when a subprocess gives an error, the main process keeps running.

ZijieJin commented 3 years ago

The empty expression files are not expected, and could you please delete the bed file and GenePos.txt file in the data/ folder and rerun scFusion? The example command is below:

python software/scFusion.py -f testdata/ -o testout/ -b 1 -e 10 -s hg19STARIndex/ -t 8 -n 0.9 -g hg19.fa -a ref_annot.gtf

I guess it will work with the proper gtf file.

scFusion is expected to report IGHJ5-IGHA1 fusion in the testout/FinalResult/FinalOutput.abridged.txt file, as I mentioned in README.

As you said, It is really hard to identify the issue when running scFusion using subprocesses. A quick way to check whether it runs properly is to check all the intermediate files and make sure they are not empty.

MasayukiNagai commented 3 years ago

I see your point about the verification on Testdata.

I ran the command again after deleting the two files but I still get empty rpkm files. Would you mind attaching the gtf file you are using or sending me the link of where to get it (I looked up STAR-Fusion repo but could not find the gtf file you mentioned)?

Also, the bed file and GenePos.txt in data folder are empty after the execution

ZijieJin commented 3 years ago

I am really sorry for the issue. The attached files are the splited gtf I used (Too big to upload) (First unzip them and then concatenate them). The gtf file I use can also be found here. Download the zip file(~30G)

Could you open the CombinePipeline_before_FS.sh in the bin/ folder and run the last command? Let's see what will happen.

ref_annot_1.gtf.zip ref_annot_2.gtf.zip

MasayukiNagai commented 3 years ago

Great! I can now get non-empty rpkm files and so on! (I used _GRCh37_gencode_v19_CTAT_libMar012021.source/gencode.v19.annotation.gtf just for record)

However, "FinalOutput.abridged.txt" only includes its header, which means that the file is basically empty. I don't see any empty files in any folder but FinalResult right now. I'll look at the code again but do you have any idea what causes this?

ZijieJin commented 3 years ago

Great! The test data here is to help you check whether you can run scFusion properly, so we don't need to be aware of the biological meaning of reported fusions.

Did you add the parameter "-n 0.9" when running scFusion? Using default parameter, scFusion will report no fusion genes in this dataset. If no fusion genes are reported after specifying the -n parameter, please check the Allresult.txt file in FInalResult/temp/ and see whether IGHJ5-IGHA1 fusion is included in this file.

MasayukiNagai commented 3 years ago

After adding "-n 0.9" parameter, I got the expected result and everything looks good!!

Thank you so much for your help!

ZijieJin commented 3 years ago

Amazing! Now, you can run scFusion with your own dataset to detect gene fusion!

And I will fix the bugs mentioned above in the next version, stay tuned!

biginfor commented 2 years ago

The good news is that everything is working normally without any errors.The bad news is that I didn't get any positive results. No matter in directory [/scFusion-1.4/Testdata/Testdata/FinalResult/FinalOutput.abridged.txt] or directory [FinalResult/temp/Allresult_filtered.txt], although relevant files are generated, there is no information. P.S.I did add parameters -n 0.9. Maybe somthing was wrong?Because when I am trouble shooting,I did change the way some packages were imported.

biginfor commented 2 years ago

Although everything seems OK, the error message is not output to the screen(BUT in [scfusion/scFusion-1.4/Testdata/Testdata/log.txt],). As a result, I didn't realize that the program didn't do "predicting" "Step using Neural Network!" "Start Statistical Model" And so on.All in all, There is a problem with the import of Python modules. Now everything is ok,Thanks a lot!

ZijieJin commented 2 years ago

Welcome. I will upgrade the user experience in the later version.

MasayukiNagai commented 2 years ago

I've successfully run the program over the test data with scFusion v2.0.1, but it would be nice if you could specify -n 0.9 and the expected output in the manual.pdf as you did on the README.md before because just running the commands in the manual generates an output with no fusion in it.

MasayukiNagai commented 2 years ago

This is not directly relevant to this issue, but after running FusionReport command, I got Final Results are in ${outdir}/FinalResult/FinalOutput.abridged.txt, which is on scFusion.py:254. However, I could not find the file probably because the file is copied to Result.abridged.txt and the folder is renamed to Resulttemp right after that. Thus, it would be great if you could adjust the print statement on scFusion.py:254. Thank you!

ZijieJin commented 2 years ago

I've successfully run the program over the test data with scFusion v2.0.1, but it would be nice if you could specify -n 0.9 and the expected output in the manual.pdf as you did on the README.md before because just running the commands in the manual generates an output with no fusion in it.

Good suggestions! Please see the lastest version!