Open irinakoester opened 5 years ago
The current LDA implementation runs out of memory with too many features (and/or motifs and mass fragments/losses - the three of them together determine the total memory needed)
Chunking the feature set is a work around solution at the moment....
Thanks @justinjjvanderhooft for your quick reply! Does it work for big data sets using the code instead of GNPS workflow?
Well, the restrictions are in the algorithm. GNPS is already quite generous with RAM! We will be looking at alternatives (hope to get the funding!) and another challenge will be the analysis of the results....how to interpret networks of 50,000+ nodes?
Can you provide the job that fails?
https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=0e119ddb9a4a4f65853f0eb048a5f663
On Fri, Oct 25, 2019 at 3:54 PM Ming Wang notifications@github.com wrote:
Can you provide the job that fails?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CCMS-UCSD/GNPS_Workflows/issues/208?email_source=notifications&email_token=AKGVCSKUATNUOYWWDENMNT3QQN2IPA5CNFSM4JAWPOWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECJYDRA#issuecomment-546537924, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKGVCSKTHZUOH6JJUMQIXVLQQN2IPANCNFSM4JAWPOWA .
-- PhD candidate Scripps Institution of Oceanography University of California San Diego
Don't think this is a dataset size issue. @madeleineernst I think this is an issue with getting data from a GNPS task, can you take a quick look at it?
@irinakoester, it seems you didn't specify NAP, Dereplicator, VarQuest nor MS2LDA job ID (or at least it doesn't show any when I clone the job)? At least one input needs to be specified.
In theory, it could only use library matches (and will take them into account), it only starts to work properly when adding SMILES from candidate structures from NAP etc. Could you try again with a NAP Task ID?
It should work without any of these options. I ran it with another data set using only the library Ids (analogs) and it worked. But I understand that it would work better with more information. That's why I wanna use CSI:FingerID.
On Sat, Oct 26, 2019 at 4:05 AM Justin van der Hooft < notifications@github.com> wrote:
In theory, it could only use library matches (and will take them into account), it only starts to work properly when adding SMILES from candidate structures from NAP etc. Could you try again with a NAP Task ID?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CCMS-UCSD/GNPS_Workflows/issues/208?email_source=notifications&email_token=AKGVCSL56SDK4P2DHZKY2HDQQQQA7A5CNFSM4JAWPOWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECKFROQ#issuecomment-546592954, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKGVCSKHIH2PDOIXWVFHXXDQQQQA7ANCNFSM4JAWPOWA .
-- PhD candidate Scripps Institution of Oceanography University of California San Diego
@mwang87, I create a PR #248 to fix the issue. Some GNPS library files have an additional '\t' separator, which creates an error while parsing. Library hits with additional '\t' will be ignored for now. Would be nice to eventually find out why it happens, so we don't lose too much data downstream?
Deployed to debug and testing here:
https://proteomics3.ucsd.edu/ProteoSAFe/status.jsp?task=99c5cc5f71dc4a6daedc6d832d5cb55e
Seems to finish successfully, going to add to tests. @irinakoester can you confirm it working well in link:
https://proteomics3.ucsd.edu/ProteoSAFe/status.jsp?task=99c5cc5f71dc4a6daedc6d832d5cb55e
Worked for a network with 9000 features but fails for network with 20 000 features.