[MolNetEnhancer][MS2LDA] Job fails for big data sets

irinakoester commented 5 years ago

Worked for a network with 9000 features but fails for network with 20 000 features.

justinjjvanderhooft commented 5 years ago

The current LDA implementation runs out of memory with too many features (and/or motifs and mass fragments/losses - the three of them together determine the total memory needed)

justinjjvanderhooft commented 5 years ago

Chunking the feature set is a work around solution at the moment....

irinakoester commented 5 years ago

Thanks @justinjjvanderhooft for your quick reply! Does it work for big data sets using the code instead of GNPS workflow?

justinjjvanderhooft commented 5 years ago

Well, the restrictions are in the algorithm. GNPS is already quite generous with RAM! We will be looking at alternatives (hope to get the funding!) and another challenge will be the analysis of the results....how to interpret networks of 50,000+ nodes?

mwang87 commented 5 years ago

Can you provide the job that fails?

irinakoester commented 5 years ago

https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=0e119ddb9a4a4f65853f0eb048a5f663

On Fri, Oct 25, 2019 at 3:54 PM Ming Wang notifications@github.com wrote:

Can you provide the job that fails?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CCMS-UCSD/GNPS_Workflows/issues/208?email_source=notifications&email_token=AKGVCSKUATNUOYWWDENMNT3QQN2IPA5CNFSM4JAWPOWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECJYDRA#issuecomment-546537924, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKGVCSKTHZUOH6JJUMQIXVLQQN2IPANCNFSM4JAWPOWA .

-- PhD candidate Scripps Institution of Oceanography University of California San Diego

mwang87 commented 5 years ago

Don't think this is a dataset size issue. @madeleineernst I think this is an issue with getting data from a GNPS task, can you take a quick look at it?

madeleineernst commented 5 years ago

@irinakoester, it seems you didn't specify NAP, Dereplicator, VarQuest nor MS2LDA job ID (or at least it doesn't show any when I clone the job)? At least one input needs to be specified.

justinjjvanderhooft commented 5 years ago

In theory, it could only use library matches (and will take them into account), it only starts to work properly when adding SMILES from candidate structures from NAP etc. Could you try again with a NAP Task ID?

irinakoester commented 5 years ago

It should work without any of these options. I ran it with another data set using only the library Ids (analogs) and it worked. But I understand that it would work better with more information. That's why I wanna use CSI:FingerID.

On Sat, Oct 26, 2019 at 4:05 AM Justin van der Hooft < notifications@github.com> wrote:

In theory, it could only use library matches (and will take them into account), it only starts to work properly when adding SMILES from candidate structures from NAP etc. Could you try again with a NAP Task ID?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CCMS-UCSD/GNPS_Workflows/issues/208?email_source=notifications&email_token=AKGVCSL56SDK4P2DHZKY2HDQQQQA7A5CNFSM4JAWPOWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECKFROQ#issuecomment-546592954, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKGVCSKHIH2PDOIXWVFHXXDQQQQA7ANCNFSM4JAWPOWA .

-- PhD candidate Scripps Institution of Oceanography University of California San Diego

madeleineernst commented 5 years ago

@mwang87, I create a PR #248 to fix the issue. Some GNPS library files have an additional '\t' separator, which creates an error while parsing. Library hits with additional '\t' will be ignored for now. Would be nice to eventually find out why it happens, so we don't lose too much data downstream?

mwang87 commented 5 years ago

Deployed to debug and testing here:

https://proteomics3.ucsd.edu/ProteoSAFe/status.jsp?task=99c5cc5f71dc4a6daedc6d832d5cb55e

mwang87 commented 5 years ago

Seems to finish successfully, going to add to tests. @irinakoester can you confirm it working well in link:

https://proteomics3.ucsd.edu/ProteoSAFe/status.jsp?task=99c5cc5f71dc4a6daedc6d832d5cb55e

CCMS-UCSD / GNPS_Workflows

[MolNetEnhancer][MS2LDA] Job fails for big data sets #208