Preprocessing of Uniprot data - Githubissues

MannLabs / alphamap

An open-source Python package for the visual annotation of proteomics data with sequence specific knowledge.

https://mannlabs.github.io/alphamap/

Apache License 2.0

74 stars 18 forks source link

Preprocessing of Uniprot data #1

Closed EugeniaVoytik closed 3 years ago

EugeniaVoytik commented 3 years ago

This merge contains a notebook with functions (saved separately into uniprot_integration.py file) allowing to preprocess downloaded from Uniprot data and save them into a pandas dataframe or .csv file.

The preprocessed Uniprot data include:

the known preprocessing events for proteins, such as signal peptide, transit peptide, propeptide, chain, peptide;
information on all available in Uniprot post-translational modifications, like modified residues (Phosphorylation, Methylation, Acetylation, etc.), Lipidation, Glycosylation, etc.;
information on sequence similarities with other proteins and the domain(s) present in a protein, such as domain, repeat, region, motif, etc.;
information on the secondary and tertiary structure of proteins, such as turn, beta-strand, helix.

The output data frame / .csv file contains information about:

protein_id(str)
feature(category)
isoform_id(str)
start(int)
end(int)
note information(str)

All functions are tested inside a notebook.

ibludau commented 3 years ago

I tried to run your notebook which looks really nice, but the tests for processing '../testdata/P11532_test_file.txt' fail. I only get 96 rows instead of 167. I could not directly spot where this happens, so we might need some more tests here. Could you maybe check this again and let me know if this works for you based on the current git status?

EugeniaVoytik commented 3 years ago

I've just checked again and everything looks fine for me. I've pushed the output after preprocessing of the P11532_test_file into the testdata folder.

ibludau commented 3 years ago

I figured out that the problem is the conversion of data types: uniprot_df.end = uniprot_df.end.astype('Int64') For me, 'NaN' gets changed to a strange '\' that does not work in combi with the logic filter: uniprot_df[(uniprot_df.start != -1) & (uniprot_df.end != -1)] We can of course change the filter but I don't think thats a good solution. We'd also like to have a logical NA for downstream processing. My suggestion would be to stick with floats for the column type. I don't think this should be too bad. If you have another suggestion thats also fine.

I guess we might have different python versions and thats why its not working the same for both of us. But we should have a stable version, so I'd suggest to adapt - what do you think?

EugeniaVoytik commented 3 years ago

Ah, ok, it's weird. I couldn't even expect that it could be a problem. Yes, of course, that it's better to leave it as a float dtype. I'll update it soon. I think that the problem is in the version of pandas. From the curiosity, which one do you have?

ibludau commented 3 years ago

Ok great - I have pandas 1.1.3 and you? Maybe I need to update

EugeniaVoytik commented 3 years ago

Ok, now I understand. I have 0.24.2 that was the last stable version of it that was supported in Holoviz. [Updating pandas is constricted by holoviz -> requires pandas[version='<=0.24.2'] Therefore, yes, it's better just to leave it with float and have no problems with dtypes. Updated.

ibludau commented 3 years ago

Oh ok - well then let's hope sticking with float now works across versions

ibludau commented 3 years ago

I made a small adjustment to the structure of the notebook. It's important that all notebooks in the 'nbs' folder can run through without any errors on all platforms and by all users. I therefore suggest to have a separate notebook in the main directory to perform the actual processing for demonstration purposes. This is also how we do it in AlphaPept.

ibludau commented 3 years ago

I would further suggest to think about the testing functions again. I realised that you basically run: "test_df = preprocess_uniprot(path_to_test_file)" in all test functions for "preprocess_uniprot". Although this does not take a lot of time, this is not super elegant and I would still suggest we perform one test per main function and simply provide more explicit error messages. Another option would be to simply generate the "test_df" as global variable and then use it in each function, but I don't particularly like this solution.. Maybe we can discuss this again.