Knowledge graph and run_Speed_up.py Error

drob2727 commented 2 years ago

Traceback (most recent call last): File "run_phagehost.py", line 64, in = subprocess.check_call(blast_cmd, shell=True) File "/fslhome/fslcollab273/.conda/envs/Host/lib/python3.7/subprocess.py", line 363, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'blastn -query out/query.fa -db blast_db/bin -outfmt 6 -out blast_tab/bin.tab -num_threads 8' returned non-zero exit status 2. folder Cyber_data/ exist... cleaning dictionary Cannot clean your folder... permission denied cat: pred/: No such file or directory folder input exist... cleaning dictionary Dictionary cleaned folder pred exist... cleaning dictionary folder Split_files exist... cleaning dictionary Dictionary cleaned folder tmp_pred exist... cleaning dictionary Knowledge Graph Error for file contig_0 Knowledge Graph Error for file contig_1 Knowledge Graph Error for file contig_2 Knowledge Graph Error for file contig_3 phage_host Error for file contig_4 Pre-trained CNN Error for file contig_5 Traceback (most recent call last): File "run_Speed_up.py", line 157, in out = subprocess.check_call(cmd, shell=True) File "/fslhome/fslcollab273/.conda/envs/Host/lib/python3.7/subprocess.py", line 363, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'cat pred/ > final_prediction.csv' returned non-zero exit status 1.

We were so close I think. Any ideas?

KennthShang commented 2 years ago

Seem the format of your files in the 'bacteria' folder is incorrect. Then, BLASTN cannot run with the command "blastn -query out/query.fa -db blast_db/bin -outfmt 6 -out blast_tab/bin.tab -num_threads 8".

You can check whether you can build the database using " makeblastdb -in bacteria/bin.fasta -dbtype nucl -parse_seqids -out blast_db/bin" and run BLASTN with your virus file.

drob2727 commented 2 years ago

I see, so its because my files end in .fa instead of .fasta. If I fixed the file extension, can I resume from where I left off or do I have to start over?

KennthShang commented 2 years ago

No, it is not because of your file extension. The program can automatically read it. I mean you need to try to use makeblastdb and blastn command to debug. Because it seems the BLAST cannot run correctly.

KennthShang commented 2 years ago

'bin.fasta' is just an example of your input prokaryotic genomes. Since I do not know your files

drob2727 commented 2 years ago

Building a new DB, current time: 01/16/2022 09:51:20 New DB name: /lustre/scratch/usr/fslcollab273/HostG/blast_db/bin New DB title: bacteria/bin.1408.fasta Sequence type: Nucleotide Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 11 sequences in 0.048352 seconds

It looks like it can make the database so should I do this for all of my bins?

drob2727 commented 2 years ago

How do I run the BlastN command? Thank you so much for the help

KennthShang commented 2 years ago

there are two dot(.) in your file name. Please use the same as the example file. accession.fa or accession.fasta. you can use dash line (_) to replace the first one.

Also make sure your sequence in your virus fasta has at least one sequence whose length is longer than 8000bp if you use Len=8000

KennthShang commented 2 years ago

After solving the problem above, run two command below to check whether the BLAST can be run correctly.

makeblastdb -in bacteria/bin_1408.fasta -dbtype nucl -parse_seqids -out blast_db/bin_1408

blastn -query query.fa -db blast_db/bin_1408 -outfmt 6 -out blast_tab/bin_1408.tab -num_threads 8

query.fa is your input file.

if the are ok, then you can rerun the HostG

drob2727 commented 2 years ago

It worked!

(HostG) -bash-4.2$ cat bin_1408.tab VF_VS_2402 k141_959580_length_146859_cov_1634.8330 93.846 130 7 1 8881 9009 1952 2081 9.83e-50 195 VF_VS_7541 k141_959580_length_146859_cov_1634.8330 100.000 52 0 0 200 251 71 20 1.72e-20 97.1 VF_VS_7852 k141_995934_length_49184_cov_1635.8327 81.558 385 63 7 329 711 48805 49183 5.55e-85 311 VF_VS_13436 k141_959580_length_146859_cov_1634.8330 100.000 141 0 0 1 141 141 1 4.60e-70 261 VF_VS_13436 k141_959580_length_146859_cov_1634.8330 96.296 135 5 0 2290 2424 154 20 2.17e-58 222 VF_VS_13847 k141_8453456_length_35143_cov_1469.2361 77.815 302 65 2 26 326 22545 22845 2.81e-47 185 VF_VS_14898 k141_959580_length_146859_cov_1634.8330 90.698 43 4 0 3552 3594 9502 9544 6.22e-09 58.4 VF_VS_20168 k141_959580_length_146859_cov_1634.8330 98.113 53 1 0 3451 3503 19 71 1.51e-19 93.5 VF_VS_34343 k141_995934_length_49184_cov_1635.8327 80.097 206 38 3 1 205 49097 48894 7.19e-37 150

Thats what I got, Does that look right? Should I start the process all over by deleting newly generated files?

KennthShang commented 2 years ago

Then, I think the program should be ok this time.

you do not need to remove them by yourself. Just rerun it should be ok

drob2727 commented 2 years ago

Im having a hard time to understand the OSError of disk quota exceeded. I ran this with 1500 GB of Ram and I have TB of space left in my drive. Any ideas?

Traceback (most recent call last): File "run_phage_phage.py", line 238, in fi = to_clusterer(ntw, out_fn+"intermediate.ntw", merged_df.copy()) File "/lustre/scratch/usr/fslcollab273/HostG/utils.py", line 138, in to_clusterer with open(fi, "wt") as f: OSError: [Errno 122] Disk quota exceeded: 'out/intermediate.ntw' folder Cyber_data/ exist... cleaning dictionary Dictionary cleaned Traceback (most recent call last): File "runCNN.py", line 56, in = SeqIO.write(record, contig_out+name+".fasta", "fasta") File "/fslhome/fslcollab273/.conda/envs/Host/lib/python3.7/site-packages/Bio/SeqIO/init.py", line 516, in write with as_handle(handle, "w") as fp: File "/fslhome/fslcollab273/.conda/envs/Host/lib/python3.7/contextlib.py", line 112, in enter return next(self.gen) File "/fslhome/fslcollab273/.conda/envs/Host/lib/python3.7/site-packages/Bio/File.py", line 72, in as_handle with open(handleish, mode, *kwargs) as fp: OSError: [Errno 122] Disk quota exceeded: 'validation/cherry_0_196.fasta' rm: cannot remove ‘stride50_val/’: No such file or directory rm: cannot remove ‘int_val/’: No such file or directory rm: cannot remove ‘filtered_val/’: No such file or directory rm: cannot remove ‘dataset/’: No such file or directory rm: cannot remove ‘split_long_reads_val/’: No such file or directory folder Cyber_data/ exist... cleaning dictionary Traceback (most recent call last): File "runCNN.py", line 56, in = SeqIO.write(record, contig_out+name+".fasta", "fasta") File "/fslhome/fslcollab273/.conda/envs/Host/lib/python3.7/site-packages/Bio/SeqIO/init.py", line 516, in write with as_handle(handle, "w") as fp: File "/fslhome/fslcollab273/.conda/envs/Host/lib/python3.7/contextlib.py", line 112, in enter return next(self.gen) File "/fslhome/fslcollab273/.conda/envs/Host/lib/python3.7/site-packages/Bio/File.py", line 72, in as_handle with open(handleish, mode, kwargs) as fp: OSError: [Errno 122] Disk quota exceeded: 'validation/cherry_0_0.fasta' rm: cannot remove ‘validation/’: No such file or directory rm: cannot remove ‘stride50_val/’: No such file or directory rm: cannot remove ‘int_val/’: No such file or directory rm: cannot remove ‘filtered_val/’: No such file or directory rm: cannot remove ‘dataset/’: No such file or directory rm: cannot remove ‘split_long_reads_val/’: No such file or directory folder Cyber_data/ exist... cleaning dictionary Traceback (most recent call last): File "runCNN.py", line 56, in = SeqIO.write(record, contig_out+name+".fasta", "fasta") File "/fslhome/fslcollab273/.conda/envs/Host/lib/python3.7/site-packages/Bio/SeqIO/init.py", line 516, in write with as_handle(handle, "w") as fp: File "/fslhome/fslcollab273/.conda/envs/Host/lib/python3.7/contextlib.py", line 112, in enter return next(self.gen) File "/fslhome/fslcollab273/.conda/envs/Host/lib/python3.7/site-packages/Bio/File.py", line 72, in as_handle with open(handleish, mode, kwargs) as fp: OSError: [Errno 122] Disk quota exceeded: 'validation/cherry_0_0.fasta' rm: cannot remove ‘validation/’: No such file or directory rm: cannot remove ‘stride50_val/’: No such file or directory rm: cannot remove ‘int_val/’: No such file or directory rm: cannot remove ‘filtered_val/’: No such file or directory rm: cannot remove ‘dataset/’: No such file or directory rm: cannot remove ‘split_long_reads_val/’: No such file or directory folder Cyber_data/ exist... cleaning dictionary Traceback (most recent call last): File "runCNN.py", line 56, in = SeqIO.write(record, contig_out+name+".fasta", "fasta") File "/fslhome/fslcollab273/.conda/envs/Host/lib/python3.7/site-packages/Bio/SeqIO/init.py", line 516, in write with as_handle(handle, "w") as fp: File "/fslhome/fslcollab273/.conda/envs/Host/lib/python3.7/contextlib.py", line 112, in enter return next(self.gen) File "/fslhome/fslcollab273/.conda/envs/Host/lib/python3.7/site-packages/Bio/File.py", line 72, in as_handle with open(handleish, mode, *kwargs) as fp: OSError: [Errno 122] Disk quota exceeded: 'validation/cherry_0_0.fasta' cat: pred/: No such file or directory folder input exist... cleaning dictionary folder pred exist... cleaning dictionary folder Split_files exist... cleaning dictionary folder tmp_pred exist... cleaning dictionary Knowledge Graph Error for file contig_0 phage_phage Error for file contig_1 Pre-trained CNN Error for file contig_2 Pre-trained CNN Error for file contig_3 Pre-trained CNN Error for file contig_4 Pre-trained CNN Error for file contig_5 Traceback (most recent call last): File "run_Speed_up.py", line 157, in out = subprocess.check_call(cmd, shell=True) File "/fslhome/fslcollab273/.conda/envs/Host/lib/python3.7/subprocess.py", line 363, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'cat pred/* > final_prediction.csv' returned non-zero exit status 1.

KennthShang commented 2 years ago

Ummmmmm.

If you are using an HPC, you can check it out with the command "quota". Since there are two types of 'Disk quota exceeded': files or size. You might be able to find out something.

However, if you are using your own computer, I am afraid that I do not know much about this.

A simple way to test is using a smaller file. For example, you can split your file into a smaller one to run as an example and check whether there exist any bugs.

drob2727 commented 2 years ago

Creating blast database... Running blastn... Creating blast database... Running blastn... run_KnowledgeGraph.py:100: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. node_feature = np.array(node_feature) folder GCN_data exist... cleaning dictionary Traceback (most recent call last): File "run_KnowledgeGraph.py", line 155, in taxa_label(taxa) File "run_KnowledgeGraph.py", line 125, in taxa_label tmpL = label_df[label_df['accession'] == node.split('.')[0]][taxa].values[0] IndexError: index 0 is out of bounds for axis 0 with size 0 cat: pred/: No such file or directory folder input exist... cleaning dictionary Dictionary cleaned folder pred exist... cleaning dictionary folder Split_files exist... cleaning dictionary Dictionary cleaned folder tmp_pred exist... cleaning dictionary Knowledge Graph Error for file contig_0 Knowledge Graph Error for file contig_1 Traceback (most recent call last): File "run_Speed_up.py", line 157, in out = subprocess.check_call(cmd, shell=True) File "/fslhome/fslcollab273/.conda/envs/Host/lib/python3.7/subprocess.py", line 363, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'cat pred/ > final_prediction.csv' returned non-zero exit status 1.

I'm having a slightly different knowledge graph error. Any ideas?

KennthShang commented 2 years ago

It seems you forget to add the taxonomy labels of your prokaryotes in dataset/label.csv.

BTW, according to your previous description, you may not know the taxonomy labels of your bins. In this situation, you can directly use names like "bin_1408" as their labels. However, we cannot ensure the model can work well. This is because HostG is a multi-class classification classifier, if the number of samples is very small, it might not be able to learn from them. But you can try.

Another solution is using our newly developed method called: Cherry. We tried to fix the aforementioned problem in this new work. The usage is nearly the same but has a new formulation. You can follow the guidelines and run it at the same time.

Best, Jiayu

drob2727 commented 2 years ago

I’ll look through it, but I do have taxonomy in the label.csv file. Maybe it’s better to just have the contigs match up with the bin names and then use my own taxonomy file since it doesn’t seems to be working. I’ll go through it again and see if there’s any discrepancies between my taxonomy file and what it is supposed to me. I’ll send a screenshot.

Thank you so much

Get Outlook for iOShttps://aka.ms/o0ukef

From: Kenneth Shang @.> Sent: Thursday, January 27, 2022 6:17:17 AM To: KennthShang/HostG @.> Cc: David Robinson @.>; Author @.> Subject: Re: [KennthShang/HostG] Knowledge graph and run_Speed_up.py Error (Issue #7)

[EXTERNAL]

It seems you forget to add the taxonomy labels of your prokaryotes in dataset/label.csv.

BTW, according to your previous description, you may not know the taxonomy labels of your bins. In this situation, you can directly use names like "bin_1408" as their labels. However, we cannot ensure the model can work well. This is because HostG is a multi-class classification classifier, if the number of samples is very small, it might not be able to learn from them. But you can try.

Another solution is using our newly developed method called: Cherryhttps://github.com/KennthShang/CHERRY. We tried to fix the aforementioned problem in this new work. The usage is nearly the same, But we a new formulation. You can follow the guidelines and run it at the same time.

Best, Jiayu

— Reply to this email directly, view it on GitHubhttps://github.com/KennthShang/HostG/issues/7#issuecomment-1023198492, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APJSNMI55NXTXP5J2UU3GUTUYFAV3ANCNFSM5MC5SLDQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.Message ID: @.***>

KennthShang commented 2 years ago

I see.

The program will treat the file names (except for the '.fasta') in the 'bacteria/' folder as the accession names.

For example, in the program

tmpL = label_df[label_df['accession'] == node.split('.')[0]][taxa].values[0]

node.split('.')[0] is bin_1408, then it will search label.csv to return its labels.

Sorry for the inconvenience. Maybe you can also try a small sample for testing first. It will save time to construct the knowledge graph (which is an O(N^2) algorithm.

Best, Jiayu

drob2727 commented 2 years ago

Makes sense to me. The way it's set up right now is like that if I'm following correctly, it says bin_1408 and then the taxonomy. Is that wrong?

Get Outlook for iOShttps://aka.ms/o0ukef

From: Kenneth Shang @.> Sent: Thursday, January 27, 2022 6:36:22 AM To: KennthShang/HostG @.> Cc: David Robinson @.>; Author @.> Subject: Re: [KennthShang/HostG] Knowledge graph and run_Speed_up.py Error (Issue #7)

[EXTERNAL]

I see.

The program will treat the file names (except for the '.fasta') in the 'bacteria/' folder as the accession names.

For example, in the program

tmpL = label_df[label_df['accession'] == node.split('.')[0]][taxa].values[0]

node.split('.')[0] is bin_1408, then it will search label.csv to return its labels.

Sorry for the inconvenience. Maybe you can also try a small sample for testing first. It will save time to construct the knowledge graph (which is an O(N^2) algorithm.

Best, Jiayu

— Reply to this email directly, view it on GitHubhttps://github.com/KennthShang/HostG/issues/7#issuecomment-1023217835, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APJSNMKD3MVUMLHRPNMUPJLUYFC5NANCNFSM5MC5SLDQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.Message ID: @.***>

KennthShang commented 2 years ago

Yes. So what you need to add into the label.csv is like (just an example)

accession,phylum,class,order,family,genus
bin_1408,Proteobacteria,Betaproteobacteria,Burkholderiales,Alcaligenaceae,Achromobacter

accession,phylum,class,order,family,genus is the header, no need to add this line.

Best, Jiayu

drob2727 commented 2 years ago

Ok gotcha. I will recheck my file when I get done with teaching today and I will verify. I really appreciate the help. We’re predicting hosts for our viruses from Antarctica. We’ve predicted through NCBI but there’s nothing better than using our own bacterial contigs. I really appreciate all the help. If we can get this through, I look forward to using this great tool in multiple papers

Get Outlook for iOShttps://aka.ms/o0ukef

From: Kenneth Shang @.> Sent: Thursday, January 27, 2022 6:42:02 AM To: KennthShang/HostG @.> Cc: David Robinson @.>; Author @.> Subject: Re: [KennthShang/HostG] Knowledge graph and run_Speed_up.py Error (Issue #7)

[EXTERNAL]

Yes. So what you need to add into the label.csv is like (just an example)

accession,phylum,class,order,family,genus bin_1408,Proteobacteria,Betaproteobacteria,Burkholderiales,Alcaligenaceae,Achromobacter

Best, Jiayu

— Reply to this email directly, view it on GitHubhttps://github.com/KennthShang/HostG/issues/7#issuecomment-1023222565, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APJSNMO7IL5OAIKSWXWFG5TUYFDSVANCNFSM5MC5SLDQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.Message ID: @.***>

KennthShang commented 2 years ago

Thanks a lot.

BTW, maybe you also want to use our new method, Cherry, in the future (although it is still under review now). In our latest experiments, it can return labels down to species level and has a more precise accuracy.

Best, Jiayu

drob2727 commented 2 years ago

Oh wow, I’ll take a look at it this afternoon

Get Outlook for iOShttps://aka.ms/o0ukef

From: Kenneth Shang @.> Sent: Thursday, January 27, 2022 6:51:09 AM To: KennthShang/HostG @.> Cc: David Robinson @.>; Author @.> Subject: Re: [KennthShang/HostG] Knowledge graph and run_Speed_up.py Error (Issue #7)

[EXTERNAL]

Thanks a lot.

BTW, maybe you also want to use our new method, Cherry, in the future (although it is still under review now). In our latest experiments, it can return labels down to species level and has a more precise accuracy.

Best, Jiayu

— Reply to this email directly, view it on GitHubhttps://github.com/KennthShang/HostG/issues/7#issuecomment-1023230998, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APJSNMMW3YH3IP3BBMS7APLUYFEU3ANCNFSM5MC5SLDQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.Message ID: @.***>

drob2727 commented 2 years ago

Everything look great after I fixed the names until this

run_KnowledgeGraph.py:100: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. node_feature = np.array(node_feature) folder GCN_data exist... cleaning dictionary Traceback (most recent call last): File "run_KnowledgeGraph.py", line 155, in taxa_label(taxa) File "run_KnowledgeGraph.py", line 125, in taxa_label tmpL = label_df[label_df['accession'] == node.split('.')[0]][taxa].values[0] IndexError: index 0 is out of bounds for axis 0 with size 0 cat: pred/: No such file or directory folder input exist... cleaning dictionary folder pred exist... cleaning dictionary folder Split_files exist... cleaning dictionary folder tmp_pred exist... cleaning dictionary Knowledge Graph Error for file contig_0 Knowledge Graph Error for file contig_1 Traceback (most recent call last): File "run_Speed_up.py", line 157, in out = subprocess.check_call(cmd, shell=True) File "/fslhome/fslcollab273/.conda/envs/Host/lib/python3.7/subprocess.py", line 363, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'cat pred/ > final_prediction.csv' returned non-zero exit status 1.

KennthShang commented 2 years ago

It seems there still exists the same problem.

Could you send me some samples for testing? For example, can you share some of your bins (4 or 5 of them?), if they are not private and too large. Then, I can test it on my HPC.

If you want, you can send them (or the link) to my email: jyshang2-c@my.cityu.edu.hk

Best, Jiayu

KennthShang commented 2 years ago

Hi, I have checked your data, it seems everything works well. The output of the examples like below:

Please check whether you have added the taxa like below (in label.csv):

Since I do not know the exact taxa, I use temp instead. You can also run the example files you sent me to test whether the program can work correctly on your PC/HPC

Best, Jiayu

drob2727 commented 2 years ago

Thank you so much for looking at that. I think it has to be my label.csv file that's wrong. Ill send a picture

From: Kenneth Shang @.> Sent: Wednesday, February 2, 2022 2:37 AM To: KennthShang/HostG @.> Cc: David Robinson @.>; Author @.> Subject: Re: [KennthShang/HostG] Knowledge graph and run_Speed_up.py Error (Issue #7)

[EXTERNAL]

Hi, I have checked your data, it seems everything works well. The output of the examples like below: [image]https://user-images.githubusercontent.com/22445402/152128790-05ae337f-7cd2-4fa7-8a8f-28a3cc965582.png

Please check whether you have added the taxa like below (in label.csv): [image]https://user-images.githubusercontent.com/22445402/152128375-92cc83c6-60a9-413f-a495-09ff3349c0b8.png

Since I do not know the exact taxa, I use temp instead. You can also run the example files you sent me to test whether the program can work correctly on your PC/HPC

Best, Jiayu

— Reply to this email directly, view it on GitHubhttps://github.com/KennthShang/HostG/issues/7#issuecomment-1027748389, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APJSNMIBT2CPQUEV37QB3DDUZD3PDANCNFSM5MC5SLDQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.Message ID: @.***>

drob2727 commented 2 years ago

This is what my label.csv file looks like. Is it the NA and no support thats giving me trouble? I tried to fix a couple things and then resubmit and it still had problems.

KennthShang / HostG

Knowledge graph and run_Speed_up.py Error #7