Fix/speedup sql - Githubissues

jcharkow commented 8 months ago

Create hashtables to prevent the need for as many SQL queries as there were previously. I created 3 hashtables run, feature, and transition.

On large files it takes ~ 1 minute the create all of the hashtables but access after that is much quicker (down from 20 seconds to less than a second).

Did not do official speed tests because the timings were so different did not feel the need.

singjc commented 8 months ago

Looks good, just need to add some docstrings.

So at initialization, the 3 hashtables for run, feature, and transition are created to reindex and pull out information based on peptide and charge. For run and features it's probably okay on memory since we only select a few columns, but for peptide_transition hash table it might take up more memory? We may have to assess the memory performance.

~~I was initially thinking of a slightly different approach, creating a hashtable of just the IDs and the necessary search columns. I.e.~~

peptide_precursor_hash

precursor_id, transition_id, modified_sequence, charge
0, 0, "PEPTIDE", 2
0, 1, "PEPTIDE", 2
0, 2, "PEPTIDE", 2
0, 3, "PEPTIDE", 2
0, 4, "PEPTIDE", 2
0, 5, "PEPTIDE", 2
...
1000, 1000, "APKTK", 3
1000, 1001, "APKTK", 3
1000, 1002, "APKTK", 3
1000, 1003, "APKTK", 3
1000, 1004, "APKTK", 3
1000, 1005, "APKTK", 3

and then if we want to get data for PEPTIDE of charge 2

We use the hashtable to look up this precursor to extract the precursor_id and the transition id, and we use these ids in the sql query

def getPeptideTransitionInfoShort(fullpeptidename, charge):

prec_id = peptide_precursor_hash.loc[(fullpeptidename, charge)]['PRECURSOR_ID']
tr_id = peptide_precursor_hash.loc[(fullpeptidename, charge)]['TRANSITION_ID']

stmt = f"""SELECT 
                PEPTIDE.ID AS PEPTIDE_ID,
                PRECURSOR.ID AS PRECURSOR_ID,
                TRANSITION.ID AS TRANSITION_ID,
                PEPTIDE.UNMODIFIED_SEQUENCE,
                PEPTIDE.MODIFIED_SEQUENCE,
                PRECURSOR.PRECURSOR_MZ,
                PRECURSOR.CHARGE AS PRECURSOR_CHARGE,
                PRECURSOR.LIBRARY_INTENSITY AS PRECURSOR_LIBRARY_INTENSITY,
                PRECURSOR.LIBRARY_RT AS PRECURSOR_LIBRARY_RT,
                {prec_lib_drift_time_query}
                TRANSITION.PRODUCT_MZ,
                TRANSITION.CHARGE AS PRODUCT_CHARGE,
                TRANSITION.TYPE AS PRODUCT_TYPE,
                TRANSITION.ORDINAL AS PRODUCT_ORDINAL,
                TRANSITION.ANNOTATION AS PRODUCT_ANNOTATION,
                TRANSITION.LIBRARY_INTENSITY AS PRODUCT_LIBRARY_INTENSITY,
                TRANSITION.DETECTING AS PRODUCT_DETECTING,
                PEPTIDE.DECOY AS PEPTIDE_DECOY,
                PRECURSOR.DECOY AS PRECURSOR_DECOY,
                TRANSITION.DECOY AS TRANSITION_DECOY
                FROM (SELECT * FROM PRECURSOR WHERE PRECURSOR.ID = {prec_id}) AS PRECURSOR
                INNER JOIN PRECURSOR_PEPTIDE_MAPPING ON PRECURSOR_PEPTIDE_MAPPING.PRECURSOR_ID = PRECURSOR.ID
                INNER JOIN PEPTIDE ON PEPTIDE.ID = PRECURSOR_PEPTIDE_MAPPING.PEPTIDE_ID
                INNER JOIN (SELECT * FROM TRANSITION_PRECURSOR_MAPPING WHERE TRANSITION_ID IN {tr_id}) AS TRANSITION_PRECURSOR_MAPPING ON TRANSITION_PRECURSOR_MAPPING.PRECURSOR_ID = PRECURSOR.ID
                INNER JOIN TRANSITION ON TRANSITION.ID = TRANSITION_PRECURSOR_MAPPING.TRANSITION_ID"""

    data = pd.read_sql(stmt, self.conn)

    return data

We actually probably wouldn't even need the tr_id, just the pr_id.

Just in general filtering on an indexed column (i.e. WHERE INDEXED_COL = value) is more efficient than filtering on a non-indexed column (i.e. WHERE NON_INDEXED_COL = value).

For example:

Filtering on an Indexed Column (ID):

SELECT * FROM PEPTIDE WHERE PEPTIDE.ID = 0;

If the ID column is indexed, the database engine can use the index to quickly locate the row with ID = 0.

Filtering on a Non-Indexed Column (SEQUENCE):

SELECT * FROM PEPTIDE WHERE PEPTIDE.SEQUENCE = "PEPTIDE";

If the SEQUENCE column is not indexed, the database engine has to scan the entire PEPTIDE table to find rows where the SEQUENCE is "PEPTIDE".

~~I'm not sure if this approach will really make a major difference or not. I think what you implemented is probably efficient enough, just wanted to throw out my approach as well.~~

Actually nevermind, going through your implementation seems similar and probably more efficient since you don't perform a second query. But I just want to keep the previous comments ^ for looking back to reflect.

jcharkow commented 8 months ago

Good point I will take a look at memory performance. Do you have any functions that you use for memory performance already in the repository? I was probably just going to look at how much memory the pandas df. I think your approach of directly querying the indexed column will be slower but also a lot more memory efficient. Memory efficiency might become more important with a lot of big runs.

singjc commented 8 months ago

Good point I will take a look at memory performance. Do you have any functions that you use for memory performance already in the repository? I was probably just going to look at how much memory the pandas df. I think your approach of directly querying the indexed column will be slower but also a lot more memory efficient. Memory efficiency might become more important with a lot of big runs.

I don't currently have any memory performance tracking implemented. I think you can create a decorator or context manager to measure a method or clock of codes memory performance. There probably is some library or method to do this.8

jcharkow commented 8 months ago

@singjc I've updated the structure to use less memory. Less information is cached resulting in faster initialization times. Because query is on indexed variables it is fairly quick. Please let me know what you think

jcharkow commented 7 months ago

@singjc is this ready for merging? I will clean up documentation later in different branch this just has some important test fixes that I think are important for merging.

singjc commented 7 months ago

@singjc is this ready for merging? I will clean up documentation later in different branch this just has some important test fixes that I think are important for merging.

Yip, I think this is good to merge. Will merge

Roestlab / massdash

Fix/speedup sql #49