upload large in house datasets (combinatorial and UNPD database)

Good afternoon. Not exactly an issue but a question.

Our group has generated several large theoretical libraries based on combinatorial chemistry for the common structures/substitutions we anticipate in our dataset. We'd like to use these in NAP workflow, but when uploading .txt files containing smiles and id we get a proxy error. I've subset the dataset to be smaller did not receive the same error.

I am wondering if it's possible to upload these large libraries (90,000 smiles structures~22mb text file)? Similarly, we'd like to upload the Universal Natural Products database (170,000 structures ~13mb total). UNPD, as a large natural product library would likely to useful to other groups as well.

I submitted these large files Yesterday (Nov 19, 2019), via the web portal as a test. I do not know if they are running or bogging down your servers. I can patiently wait for them to run, but if they are bogging down your computers or If this takes up too much computing power on your end. Is it possible to share some of the code that converts the .txt database into the fully classified database that NAP uses? We could likely generate our own formatted databases on our own computing resources.

I just wanted to be proactive and ask how you suggest handling these larger spectral databases.

Thanks -DF

Hello DF,

thanks for your contact.

I imagine that when you say you submitted a database you mean you tried the formatting tool: http://dorresteinappshub.ucsd.edu:5002/upload

The code for both, in house formatting https://github.com/DorresteinLaboratory/NAP_ProteoSAFe/blob/master/formatdb/formadb.ipynb

and for the webserver https://github.com/computational-chemical-biology/formatdb

are available. If you get to the code, you will notice that we only retrieve class assignments for structures already classified, and it is still quite slow (classifier server also limits the number of queries now).

If you need help, you can directly email me the structures and I will do the queries for you.

Cheers, Ricardo

Em qua., 20 de nov. de 2019 às 16:09, dlforrister notifications@github.com escreveu:

Good afternoon. Not exactly an issue but a question.

Our group has generated several large theoretical libraries based on combinatorial chemistry for the common structures/substitutions we anticipate in our dataset. We'd like to use these in NAP workflow, but when uploading .txt files containing smiles and id we get a proxy error. I've subset the dataset to be smaller did not receive the same error.

I am wondering if it's possible to upload these large libraries (90,000 smiles structures~22mb text file)? Similarly, we'd like to upload the Universal Natural Products database (170,000 structures ~13mb total). UNPD, as a large natural product library would likely to useful to other groups as well.

I submitted these large files Yesterday (Nov 19, 2019), via the web portal as a test. I do not know if they are running or bogging down your servers. I can patiently wait for them to run, but if they are bogging down your computers or If this takes up too much computing power on your end. Is it possible to share some of the code that converts the .txt database into the fully classified database that NAP uses? We could likely generate our own formatted databases on our own computing resources.

I just wanted to be proactive and ask how you suggest handling these larger spectral databases.

Thanks -DF

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Thanks for the super fast reply!

I got the code working yesterday on my machine and it ran overnight. I was able to process the biggest one ~200K structures! classify is definitely the rate-limiting step but all and all not too bad.

Do you think it's worth making this in house database public? UNPD ( https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3636197/). Our lab started using it for in silico based on the fact that CFM-ID uses it as a database in this paper (https://pubs.acs.org/doi/10.1021/acs.analchem.5b04804). Figured it makes sense to use it in both our in silico prediction methods (CFM-ID and NAP/metfrag)

Best,

Dale Forrister

On Wed, Nov 20, 2019 at 11:21 AM Ricardo notifications@github.com wrote:

Hello DF,

thanks for your contact.

I imagine that when you say you submitted a database you mean you tried the formatting tool: http://dorresteinappshub.ucsd.edu:5002/upload

The code for both, in house formatting

https://github.com/DorresteinLaboratory/NAP_ProteoSAFe/blob/master/formatdb/formadb.ipynb

and for the webserver https://github.com/computational-chemical-biology/formatdb

are available. If you get to the code, you will notice that we only retrieve class assignments for structures already classified, and it is still quite slow (classifier server also limits the number of queries now).

If you need help, you can directly email me the structures and I will do the queries for you.

Cheers, Ricardo

Em qua., 20 de nov. de 2019 às 16:09, dlforrister notifications@github.com escreveu:

Good afternoon. Not exactly an issue but a question.

Our group has generated several large theoretical libraries based on combinatorial chemistry for the common structures/substitutions we anticipate in our dataset. We'd like to use these in NAP workflow, but when uploading .txt files containing smiles and id we get a proxy error. I've subset the dataset to be smaller did not receive the same error.

I am wondering if it's possible to upload these large libraries (90,000 smiles structures~22mb text file)? Similarly, we'd like to upload the Universal Natural Products database (170,000 structures ~13mb total). UNPD, as a large natural product library would likely to useful to other groups as well.

I submitted these large files Yesterday (Nov 19, 2019), via the web portal as a test. I do not know if they are running or bogging down your servers. I can patiently wait for them to run, but if they are bogging down your computers or If this takes up too much computing power on your end. Is it possible to share some of the code that converts the .txt database into the fully classified database that NAP uses? We could likely generate our own formatted databases on our own computing resources.

I just wanted to be proactive and ask how you suggest handling these larger spectral databases.

Thanks -DF

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DorresteinLaboratory/NAP_ProteoSAFe/issues/1?email_source=notifications&email_token=ADTWESQKN2CUOSTXPVSV7EDQUV5Y5A5CNFSM4JPXJDZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEETP7JQ#issuecomment-556203942, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTWESW6MFFMMGDGIZZ2DYDQUV5Y5ANCNFSM4JPXJDZA .

-- PhD Candidate Coley/Kursar Lab Department of Biology 257 S 1400 E, University of Utah Salt Lake City, UT 84112

Hello Dalle,

If you share UNPD with me, I would be happy to make it available through NAP's interface.

I also know that your lab has biological source from where some compounds were isolated, is that publicly available? Can you also share that?

Cheers, Ricardo

On Thu, Nov 21, 2019, 1:19 PM dlforrister notifications@github.com wrote:

Thanks for the super fast reply!

I got the code working yesterday on my machine and it ran overnight. I was able to process the biggest one ~200K structures! classify is definitely the rate-limiting step but all and all not too bad.

Do you think it's worth making this in house database public? UNPD ( https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3636197/). Our lab started using it for in silico based on the fact that CFM-ID uses it as a database in this paper (https://pubs.acs.org/doi/10.1021/acs.analchem.5b04804). Figured it makes sense to use it in both our in silico prediction methods (CFM-ID and NAP/metfrag)

Best,

Dale Forrister

On Wed, Nov 20, 2019 at 11:21 AM Ricardo notifications@github.com wrote:

Hello DF,

thanks for your contact.

I imagine that when you say you submitted a database you mean you tried the formatting tool: http://dorresteinappshub.ucsd.edu:5002/upload

The code for both, in house formatting

https://github.com/DorresteinLaboratory/NAP_ProteoSAFe/blob/master/formatdb/formadb.ipynb

and for the webserver https://github.com/computational-chemical-biology/formatdb

are available. If you get to the code, you will notice that we only retrieve class assignments for structures already classified, and it is still quite slow (classifier server also limits the number of queries now).

If you need help, you can directly email me the structures and I will do the queries for you.

Cheers, Ricardo

Em qua., 20 de nov. de 2019 às 16:09, dlforrister notifications@github.com escreveu:

Good afternoon. Not exactly an issue but a question.

Our group has generated several large theoretical libraries based on combinatorial chemistry for the common structures/substitutions we anticipate in our dataset. We'd like to use these in NAP workflow, but when uploading .txt files containing smiles and id we get a proxy error. I've subset the dataset to be smaller did not receive the same error.

I am wondering if it's possible to upload these large libraries (90,000 smiles structures~22mb text file)? Similarly, we'd like to upload the Universal Natural Products database (170,000 structures ~13mb total). UNPD, as a large natural product library would likely to useful to other groups as well.

I submitted these large files Yesterday (Nov 19, 2019), via the web portal as a test. I do not know if they are running or bogging down your servers. I can patiently wait for them to run, but if they are bogging down your computers or If this takes up too much computing power on your end. Is it possible to share some of the code that converts the .txt database into the fully classified database that NAP uses? We could likely generate our own formatted databases on our own computing resources.

I just wanted to be proactive and ask how you suggest handling these larger spectral databases.

Thanks -DF

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/DorresteinLaboratory/NAP_ProteoSAFe/issues/1?email_source=notifications&email_token=ADTWESQKN2CUOSTXPVSV7EDQUV5Y5A5CNFSM4JPXJDZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEETP7JQ#issuecomment-556203942 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ADTWESW6MFFMMGDGIZZ2DYDQUV5Y5ANCNFSM4JPXJDZA

.

-- PhD Candidate Coley/Kursar Lab Department of Biology 257 S 1400 E, University of Utah Salt Lake City, UT 84112

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DorresteinLaboratory/NAP_ProteoSAFe/issues/1?email_source=notifications&email_token=AAN2SA7CKH77AK2GZHJ2R6LQU2YINA5CNFSM4JPXJDZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE2ZH3I#issuecomment-557159405, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN2SAYPAAH56U4BRTXFDA3QU2YINANCNFSM4JPXJDZA .

Dear Dele,

NAP does not support multiple files, you should concatenate all structures in one file, see attached.

http://proteomics2.ucsd.edu/ProteoSAFe/status.jsp?task=d38a75d5e830476a9cdbbcb8f3008932

Unique identifiers: 117911 Unique first block inchikey: 124060

Also, you have selected the wrong addict type for the selected acquisition mode. Take a look at the documentation

https://ccms-ucsd.github.io/GNPSDocumentation/gnpsanalysisoverview/#advanced-analysis-tools

Cheers, Ricardo

undp.zip https://drive.google.com/file/d/0BzKRIbLR_npcNW81cG9rZk44NTdvMGVEUjFvTHZjaC1tZ0tr/view?usp=drivesdk

Em sex, 6 de dez de 2019 20:23, dlforrister notifications@github.com escreveu:

Hi,

I was finally able to run NAP but got the following error. Any ideas if this has to do with the in house databases or another issue?

Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection Calls: load -> readChar In addition: Warning message: In readChar(con, 5L, useBytes = TRUE) : cannot open compressed file 'split_data/tabgnps.rda', probable reason 'No such file or directory' Execution halted Traceback (most recent call last): File "/data/beta-proteomics2/tools/nap_ccms2/merge_fragments.py", line 32, in main() File "/data/beta-proteomics2/tools/nap_ccms2/merge_fragments.py", line 14, in main fls = os.listdir('fragmenter_res') FileNotFoundError: [Errno 2] No such file or directory: 'fragmenter_res' Tool execution terminates abnormally with exit code [1]

job id = c4ca886765bb4001b17b1e72a414792b

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DorresteinLaboratory/NAP_ProteoSAFe/issues/1?email_source=notifications&email_token=AAN2SA63DTDAJYKA3QGGY73QXLNF7A5CNFSM4JPXJDZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGFVIPQ#issuecomment-562779198, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN2SA6UNJL3KORXZHD5ONLQXLNF7ANCNFSM4JPXJDZA .

Thanks again for all the info.

I was able to concatenate all of our in_house databases into a single upload and run on GNPS last night.

One thing that I don't understand is that in all the iterations I've run over the past two days ended with 3685 matched results. This is surprising because they all have different numbers of candidate structures to compare to....1) Only the UNPD database (the job you just sent me above - d38a75d5e830476a9cdbbcb8f3008932 http://proteomics2.ucsd.edu/ProteoSAFe/status.jsp?task=d38a75d5e830476a9cdbbcb8f3008932) 2) All the databases run but in separate files( ID=5a2d9c6f19204814b148ce48092da8f9) 3) All databases in a single file ( ID=690954541fba4868b5542da0c52e6e6d ). I'm wondering why the number of hits doesn't change? When I've added more databases, NAP is definitely matching to the additional structures (i.e. structures that are only in the run 3 with all the databases show up in the list of matches) yet, it always gets the same total number of hits?.

Is there a limit on the number of hits I'm not understanding? Shouldn't I expect more hits with more databases added? Does it always provide that max number of hits, but with more databases it's changing which one selected as best?

-DF

On Sat, Dec 7, 2019 at 5:42 PM Ricardo notifications@github.com wrote:

Dear Dele,

NAP does not support multiple files, you should concatenate all structures in one file, see attached.

http://proteomics2.ucsd.edu/ProteoSAFe/status.jsp?task=d38a75d5e830476a9cdbbcb8f3008932

Unique identifiers: 117911 Unique first block inchikey: 124060

Also, you have selected the wrong addict type for the selected acquisition mode. Take a look at the documentation

https://ccms-ucsd.github.io/GNPSDocumentation/gnpsanalysisoverview/#advanced-analysis-tools

Cheers, Ricardo

undp.zip < https://drive.google.com/file/d/0BzKRIbLR_npcNW81cG9rZk44NTdvMGVEUjFvTHZjaC1tZ0tr/view?usp=drivesdk

Em sex, 6 de dez de 2019 20:23, dlforrister notifications@github.com escreveu:

Hi,

I was finally able to run NAP but got the following error. Any ideas if this has to do with the in house databases or another issue?

Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection Calls: load -> readChar In addition: Warning message: In readChar(con, 5L, useBytes = TRUE) : cannot open compressed file 'split_data/tabgnps.rda', probable reason 'No such file or directory' Execution halted Traceback (most recent call last): File "/data/beta-proteomics2/tools/nap_ccms2/merge_fragments.py", line 32, in main() File "/data/beta-proteomics2/tools/nap_ccms2/merge_fragments.py", line 14, in main fls = os.listdir('fragmenter_res') FileNotFoundError: [Errno 2] No such file or directory: 'fragmenter_res' Tool execution terminates abnormally with exit code [1]

job id = c4ca886765bb4001b17b1e72a414792b

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/DorresteinLaboratory/NAP_ProteoSAFe/issues/1?email_source=notifications&email_token=AAN2SA63DTDAJYKA3QGGY73QXLNF7A5CNFSM4JPXJDZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGFVIPQ#issuecomment-562779198 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAN2SA6UNJL3KORXZHD5ONLQXLNF7ANCNFSM4JPXJDZA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DorresteinLaboratory/NAP_ProteoSAFe/issues/1?email_source=notifications&email_token=ADTWESUD7TNN546M7O2GNJDQXQ7IJA5CNFSM4JPXJDZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGGS4HQ#issuecomment-562900510, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTWESQORLTQM3Y2LSFWUK3QXQ7IJANCNFSM4JPXJDZA .

-- PhD Candidate Coley/Kursar Lab Department of Biology 257 S 1400 E, University of Utah Salt Lake City, UT 84112

Dear Dale,

happy to hear it is woking.

The number you see here: https://proteomics2.ucsd.edu/ProteoSAFe/result.jsp?task=d38a75d5e830476a9cdbbcb8f3008932&view=summary_report

3685 is not the number of hits, this is the number of nodes in your network that have at least one connection, and can be used for propagation. This will be constant for the same GNPS job.

The simplest way to recover the number of hits would be to download the table in the link above, and if you open the 'node_attributes_table.tsv' files (as instructed in the documentation) with excel or similar, go to 'MetFragID', order the table by this column and and count the number of non-empty rows on that column, that is your number of nodes that had at least one in silico hit, in this case 2342

Hope this is clear.

Cheers, Ricardo

Em qua., 11 de dez. de 2019 às 15:04, dlforrister notifications@github.com escreveu:

Thanks again for all the info.

I was able to concatenate all of our in_house databases into a single upload and run on GNPS last night.

One thing that I don't understand is that in all the iterations I've run over the past two days ended with 3685 matched results. This is surprising because they all have different numbers of candidate structures to compare to....1) Only the UNPD database (the job you just sent me above - d38a75d5e830476a9cdbbcb8f3008932 < http://proteomics2.ucsd.edu/ProteoSAFe/status.jsp?task=d38a75d5e830476a9cdbbcb8f3008932

) 2) All the databases run but in separate files( ID=5a2d9c6f19204814b148ce48092da8f9) 3) All databases in a single file ( ID=690954541fba4868b5542da0c52e6e6d ). I'm wondering why the number of hits doesn't change? When I've added more databases, NAP is definitely matching to the additional structures (i.e. structures that are only in the run 3 with all the databases show up in the list of matches) yet, it always gets the same total number of hits?.

Is there a limit on the number of hits I'm not understanding? Shouldn't I expect more hits with more databases added? Does it always provide that max number of hits, but with more databases it's changing which one selected as best?

-DF

On Sat, Dec 7, 2019 at 5:42 PM Ricardo notifications@github.com wrote:

Dear Dele,

NAP does not support multiple files, you should concatenate all structures in one file, see attached.

http://proteomics2.ucsd.edu/ProteoSAFe/status.jsp?task=d38a75d5e830476a9cdbbcb8f3008932

Unique identifiers: 117911 Unique first block inchikey: 124060

Also, you have selected the wrong addict type for the selected acquisition mode. Take a look at the documentation

https://ccms-ucsd.github.io/GNPSDocumentation/gnpsanalysisoverview/#advanced-analysis-tools

Cheers, Ricardo

undp.zip <

https://drive.google.com/file/d/0BzKRIbLR_npcNW81cG9rZk44NTdvMGVEUjFvTHZjaC1tZ0tr/view?usp=drivesdk

Em sex, 6 de dez de 2019 20:23, dlforrister notifications@github.com escreveu:

Hi,

I was finally able to run NAP but got the following error. Any ideas if this has to do with the in house databases or another issue?

Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection Calls: load -> readChar In addition: Warning message: In readChar(con, 5L, useBytes = TRUE) : cannot open compressed file 'split_data/tabgnps.rda', probable reason 'No such file or directory' Execution halted Traceback (most recent call last): File "/data/beta-proteomics2/tools/nap_ccms2/merge_fragments.py", line 32, in main() File "/data/beta-proteomics2/tools/nap_ccms2/merge_fragments.py", line 14, in main fls = os.listdir('fragmenter_res') FileNotFoundError: [Errno 2] No such file or directory: 'fragmenter_res' Tool execution terminates abnormally with exit code [1]

job id = c4ca886765bb4001b17b1e72a414792b

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/DorresteinLaboratory/NAP_ProteoSAFe/issues/1?email_source=notifications&email_token=AAN2SA63DTDAJYKA3QGGY73QXLNF7A5CNFSM4JPXJDZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGFVIPQ#issuecomment-562779198

, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AAN2SA6UNJL3KORXZHD5ONLQXLNF7ANCNFSM4JPXJDZA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/DorresteinLaboratory/NAP_ProteoSAFe/issues/1?email_source=notifications&email_token=ADTWESUD7TNN546M7O2GNJDQXQ7IJA5CNFSM4JPXJDZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGGS4HQ#issuecomment-562900510 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ADTWESQORLTQM3Y2LSFWUK3QXQ7IJANCNFSM4JPXJDZA

.

-- PhD Candidate Coley/Kursar Lab Department of Biology 257 S 1400 E, University of Utah Salt Lake City, UT 84112

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DorresteinLaboratory/NAP_ProteoSAFe/issues/1?email_source=notifications&email_token=AAN2SA7VFHWATK2CR3BODODQYETRHA5CNFSM4JPXJDZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGUBNXI#issuecomment-564664029, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN2SAZA5QNFNK4H5OIL25LQYETRHANCNFSM4JPXJDZA .

That makes a lot more sense. Thanks for explaining.

One more question that I have refers to the different scoring metrics. I'm having a hard time wrapping my head around the scores. I understand the difference between Fusion and Consensus methods based on the NAP paper. However, what is the actual score value? Is it just a rank or is it something more like a cosine where the value ranges from 0-1?

If this is in the documentation somewhere, sorry for missing it but I've been looking and not seeing it.

Basically, I'm wondering how to weed through the ~2600 hits to understand what is potentially a real hit and what is not. This is a little more intuitive when dealing with cosine scores.

Sorry that I'm asking so many questions! I seriously appreciate all the help. It's really great for someone to create a tool for the community and provide support to others using it!

Cheers,

Dale

On Wed, Dec 11, 2019 at 11:22 AM Ricardo notifications@github.com wrote:

Dear Dale,

happy to hear it is woking.

The number you see here:

https://proteomics2.ucsd.edu/ProteoSAFe/result.jsp?task=d38a75d5e830476a9cdbbcb8f3008932&view=summary_report

3685 is not the number of hits, this is the number of nodes in your network that have at least one connection, and can be used for propagation. This will be constant for the same GNPS job.

The simplest way to recover the number of hits would be to download the table in the link above, and if you open the 'node_attributes_table.tsv' files (as instructed in the documentation) with excel or similar, go to 'MetFragID', order the table by this column and and count the number of non-empty rows on that column, that is your number of nodes that had at least one in silico hit, in this case 2342

Hope this is clear.

Cheers, Ricardo

Em qua., 11 de dez. de 2019 às 15:04, dlforrister < notifications@github.com> escreveu:

Thanks again for all the info.

I was able to concatenate all of our in_house databases into a single upload and run on GNPS last night.

One thing that I don't understand is that in all the iterations I've run over the past two days ended with 3685 matched results. This is surprising because they all have different numbers of candidate structures to compare to....1) Only the UNPD database (the job you just sent me above - d38a75d5e830476a9cdbbcb8f3008932 <

http://proteomics2.ucsd.edu/ProteoSAFe/status.jsp?task=d38a75d5e830476a9cdbbcb8f3008932

) 2) All the databases run but in separate files( ID=5a2d9c6f19204814b148ce48092da8f9) 3) All databases in a single file ( ID=690954541fba4868b5542da0c52e6e6d ). I'm wondering why the number of hits doesn't change? When I've added more databases, NAP is definitely matching to the additional structures (i.e. structures that are only in the run 3 with all the databases show up in the list of matches) yet, it always gets the same total number of hits?.

Is there a limit on the number of hits I'm not understanding? Shouldn't I expect more hits with more databases added? Does it always provide that max number of hits, but with more databases it's changing which one selected as best?

-DF

On Sat, Dec 7, 2019 at 5:42 PM Ricardo notifications@github.com wrote:

Dear Dele,

NAP does not support multiple files, you should concatenate all structures in one file, see attached.

http://proteomics2.ucsd.edu/ProteoSAFe/status.jsp?task=d38a75d5e830476a9cdbbcb8f3008932

Unique identifiers: 117911 Unique first block inchikey: 124060

Also, you have selected the wrong addict type for the selected acquisition mode. Take a look at the documentation

https://ccms-ucsd.github.io/GNPSDocumentation/gnpsanalysisoverview/#advanced-analysis-tools

Cheers, Ricardo

undp.zip <

https://drive.google.com/file/d/0BzKRIbLR_npcNW81cG9rZk44NTdvMGVEUjFvTHZjaC1tZ0tr/view?usp=drivesdk

Em sex, 6 de dez de 2019 20:23, dlforrister notifications@github.com escreveu:

Hi,

I was finally able to run NAP but got the following error. Any ideas if this has to do with the in house databases or another issue?

Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection Calls: load -> readChar In addition: Warning message: In readChar(con, 5L, useBytes = TRUE) : cannot open compressed file 'split_data/tabgnps.rda', probable reason 'No such file or directory' Execution halted Traceback (most recent call last): File "/data/beta-proteomics2/tools/nap_ccms2/merge_fragments.py", line 32, in main() File "/data/beta-proteomics2/tools/nap_ccms2/merge_fragments.py", line 14, in main fls = os.listdir('fragmenter_res') FileNotFoundError: [Errno 2] No such file or directory: 'fragmenter_res' Tool execution terminates abnormally with exit code [1]

job id = c4ca886765bb4001b17b1e72a414792b

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/DorresteinLaboratory/NAP_ProteoSAFe/issues/1?email_source=notifications&email_token=AAN2SA63DTDAJYKA3QGGY73QXLNF7A5CNFSM4JPXJDZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGFVIPQ#issuecomment-562779198

, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AAN2SA6UNJL3KORXZHD5ONLQXLNF7ANCNFSM4JPXJDZA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/DorresteinLaboratory/NAP_ProteoSAFe/issues/1?email_source=notifications&email_token=ADTWESUD7TNN546M7O2GNJDQXQ7IJA5CNFSM4JPXJDZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGGS4HQ#issuecomment-562900510

, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ADTWESQORLTQM3Y2LSFWUK3QXQ7IJANCNFSM4JPXJDZA

.

-- PhD Candidate Coley/Kursar Lab Department of Biology 257 S 1400 E, University of Utah Salt Lake City, UT 84112

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/DorresteinLaboratory/NAP_ProteoSAFe/issues/1?email_source=notifications&email_token=AAN2SA7VFHWATK2CR3BODODQYETRHA5CNFSM4JPXJDZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGUBNXI#issuecomment-564664029 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAN2SAZA5QNFNK4H5OIL25LQYETRHANCNFSM4JPXJDZA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DorresteinLaboratory/NAP_ProteoSAFe/issues/1?email_source=notifications&email_token=ADTWESX2AZDITMXHDLOSZFDQYEVWVA5CNFSM4JPXJDZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGUDD5I#issuecomment-564670965, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTWESTJB64LOQB5ZFMWRZTQYEVWVANCNFSM4JPXJDZA .

-- PhD Candidate Coley/Kursar Lab Department of Biology 257 S 1400 E, University of Utah Salt Lake City, UT 84112

Hi Dale,

to learn more about the score you can take a look at MetFrag's paper https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-148

Basically, to generate the simulated fragments, the bond energy is taken into account, and to score the candidates, the number of predicted fragments and their intensity is taken into account.

The scores are scaled from 0 to 1. But unfortunately, it only makes sense to compare among candidates. There is no threshold score, for example, should I trust a score below 0.6? Not possible to say. In most cases, if you have a candidate with a score 0.9 and other with 0.09, it is fair to say the higher score is more likely.

DEREPLICATOR+ (PLUS) uses p-values, so you can know when a specific candidate is reliable https://ccms-ucsd.github.io/GNPSDocumentation/dereplicator/

For NAP's validation we observed that following order of confidence: Fusion > Consensus > MetFrag

So, the tool is most useful when you have a spectral library match and the neighbor in the network has a candidate of similar chemical class. We do not advise to use all predictions blindly. At dataset level you may have a fair chemical class prediction, then, if you have a few candidates that you want to look deeper, browsing NAP predictions on the network neighborhood may be helpful.

Cheers, Ricardo

Em sex., 13 de dez. de 2019 às 12:16, dlforrister notifications@github.com escreveu:

That makes a lot more sense. Thanks for explaining.

One more question that I have refers to the different scoring metrics. I'm having a hard time wrapping my head around the scores. I understand the difference between Fusion and Consensus methods based on the NAP paper. However, what is the actual score value? Is it just a rank or is it something more like a cosine where the value ranges from 0-1?

If this is in the documentation somewhere, sorry for missing it but I've been looking and not seeing it.

Basically, I'm wondering how to weed through the ~2600 hits to understand what is potentially a real hit and what is not. This is a little more intuitive when dealing with cosine scores.

Sorry that I'm asking so many questions! I seriously appreciate all the help. It's really great for someone to create a tool for the community and provide support to others using it!

Cheers,

Dale

On Wed, Dec 11, 2019 at 11:22 AM Ricardo notifications@github.com wrote:

Dear Dale,

happy to hear it is woking.

The number you see here:

https://proteomics2.ucsd.edu/ProteoSAFe/result.jsp?task=d38a75d5e830476a9cdbbcb8f3008932&view=summary_report

3685 is not the number of hits, this is the number of nodes in your network that have at least one connection, and can be used for propagation. This will be constant for the same GNPS job.

The simplest way to recover the number of hits would be to download the table in the link above, and if you open the 'node_attributes_table.tsv' files (as instructed in the documentation) with excel or similar, go to 'MetFragID', order the table by this column and and count the number of non-empty rows on that column, that is your number of nodes that had at least one in silico hit, in this case 2342

Hope this is clear.

Cheers, Ricardo

Em qua., 11 de dez. de 2019 às 15:04, dlforrister < notifications@github.com> escreveu:

Thanks again for all the info.

I was able to concatenate all of our in_house databases into a single upload and run on GNPS last night.

One thing that I don't understand is that in all the iterations I've run over the past two days ended with 3685 matched results. This is surprising because they all have different numbers of candidate structures to compare to....1) Only the UNPD database (the job you just sent me above - d38a75d5e830476a9cdbbcb8f3008932 <

http://proteomics2.ucsd.edu/ProteoSAFe/status.jsp?task=d38a75d5e830476a9cdbbcb8f3008932

) 2) All the databases run but in separate files( ID=5a2d9c6f19204814b148ce48092da8f9) 3) All databases in a single file ( ID=690954541fba4868b5542da0c52e6e6d ). I'm wondering why the number of hits doesn't change? When I've added more databases, NAP is definitely matching to the additional structures (i.e. structures that are only in the run 3 with all the databases show up in the list of matches) yet, it always gets the same total number of hits?.

Is there a limit on the number of hits I'm not understanding? Shouldn't I expect more hits with more databases added? Does it always provide that max number of hits, but with more databases it's changing which one selected as best?

-DF

On Sat, Dec 7, 2019 at 5:42 PM Ricardo notifications@github.com wrote:

Dear Dele,

NAP does not support multiple files, you should concatenate all structures in one file, see attached.

http://proteomics2.ucsd.edu/ProteoSAFe/status.jsp?task=d38a75d5e830476a9cdbbcb8f3008932

Unique identifiers: 117911 Unique first block inchikey: 124060

Also, you have selected the wrong addict type for the selected acquisition mode. Take a look at the documentation

https://ccms-ucsd.github.io/GNPSDocumentation/gnpsanalysisoverview/#advanced-analysis-tools

Cheers, Ricardo

undp.zip <

https://drive.google.com/file/d/0BzKRIbLR_npcNW81cG9rZk44NTdvMGVEUjFvTHZjaC1tZ0tr/view?usp=drivesdk

Em sex, 6 de dez de 2019 20:23, dlforrister notifications@github.com escreveu:

Hi,

I was finally able to run NAP but got the following error. Any ideas if this has to do with the in house databases or another issue?

Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection Calls: load -> readChar In addition: Warning message: In readChar(con, 5L, useBytes = TRUE) : cannot open compressed file 'split_data/tabgnps.rda', probable reason 'No such file or directory' Execution halted Traceback (most recent call last): File "/data/beta-proteomics2/tools/nap_ccms2/merge_fragments.py", line 32, in main() File "/data/beta-proteomics2/tools/nap_ccms2/merge_fragments.py", line 14, in main fls = os.listdir('fragmenter_res') FileNotFoundError: [Errno 2] No such file or directory: 'fragmenter_res' Tool execution terminates abnormally with exit code [1]

job id = c4ca886765bb4001b17b1e72a414792b

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/DorresteinLaboratory/NAP_ProteoSAFe/issues/1?email_source=notifications&email_token=AAN2SA63DTDAJYKA3QGGY73QXLNF7A5CNFSM4JPXJDZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGFVIPQ#issuecomment-562779198

, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AAN2SA6UNJL3KORXZHD5ONLQXLNF7ANCNFSM4JPXJDZA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/DorresteinLaboratory/NAP_ProteoSAFe/issues/1?email_source=notifications&email_token=ADTWESUD7TNN546M7O2GNJDQXQ7IJA5CNFSM4JPXJDZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGGS4HQ#issuecomment-562900510

, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ADTWESQORLTQM3Y2LSFWUK3QXQ7IJANCNFSM4JPXJDZA

.

-- PhD Candidate Coley/Kursar Lab Department of Biology 257 S 1400 E, University of Utah Salt Lake City, UT 84112

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/DorresteinLaboratory/NAP_ProteoSAFe/issues/1?email_source=notifications&email_token=AAN2SA7VFHWATK2CR3BODODQYETRHA5CNFSM4JPXJDZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGUBNXI#issuecomment-564664029 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAN2SAZA5QNFNK4H5OIL25LQYETRHANCNFSM4JPXJDZA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DorresteinLaboratory/NAP_ProteoSAFe/issues/1?email_source=notifications&email_token=ADTWESX2AZDITMXHDLOSZFDQYEVWVA5CNFSM4JPXJDZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGUDD5I#issuecomment-564670965, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTWESTJB64LOQB5ZFMWRZTQYEVWVANCNFSM4JPXJDZA .

-- PhD Candidate Coley/Kursar Lab Department of Biology 257 S 1400 E, University of Utah Salt Lake City, UT 84112

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

DorresteinLaboratory / NAP_ProteoSAFe

upload large in house datasets (combinatorial and UNPD database) #1