When uploading DIA-NN report.tsv: param\parameterized.py:337: DtypeWarning: Columns (2) have mixed types

MannLabs / alphamap

An open-source Python package for the visual annotation of proteomics data with sequence specific knowledge.

https://mannlabs.github.io/alphamap/

Apache License 2.0

76 stars 18 forks source link

When uploading DIA-NN report.tsv: param\parameterized.py:337: DtypeWarning: Columns (2) have mixed types #22

Closed kaisengit closed 3 years ago

kaisengit commented 3 years ago

Describe the bug When uploading a DIA-NN report.tsv file the following error is displayed in the terminal:

param\parameterized.py:337: DtypeWarning: Columns (2) have mixed types.Specify dtype option on import or set low_memory=False.

The error shown in the browser is the following:

The columns necessary for further analysis cannot be extracted from the first experimental file. Please check the data uploading instructions for a particular software tool.

From my own testing of working with DIA-NN output (version 1.8) with pandas setting low_memory=False fixes the problem.

To Reproduce Steps to reproduce the behavior:

Enter the path to a DIA-NN output file
See that the sample names are parsed correctly
Click Upload
See error

Expected behavior No error

Desktop (please complete the following information):

Installation Type: Windows Installer
OS: Windows 10
Version 0.0.8

EugeniaVoytik commented 3 years ago

Hello,

Thanks a lot for the submission of the bug!

What you see in the terminal is just a warning that doesn't have any influence on the work of the tool. The error shown in the GUI tells that the program can't extract the columns that should be used for further analysis and visualization. All columns that should be present in the DIA-NN output are mentioned in the GUI and in the manual (Protein.Ids, Modified.Sequence, Run). You may also check our test file for the DIA-NN.

If you have all above-mentioned columns in the report.tsv file and still get the same error message, I would be glad if you could share with us the whole file or even part of it (several rows together with the names of the columns).

If you still have any other questions, I'm happy to help! 😀

Best Jane

kaisengit commented 3 years ago

Hi there and thanks for the help! My report.tsv definitely contains these columns you mention. In fact it contains even more columns than your DIA-NN test data file:

{'First.Protein.Description',
 'Lib.PG.Q.Value',
 'Lib.PTM.Site.Confidence',
 'Ms1.Translated',
 'PEP',
 'PTM.Informative',
 'PTM.Localising',
 'PTM.Q.Value',
 'PTM.Site.Confidence',
 'PTM.Specific'}

These columns are found inside my report.tsv but not in your file. Maybe that is where the problem is coming from? It could be that DIA-NN 1.8 introduced some new columns, I guess. I'll also see if I can prepare a dummy .tsv file to showcase the problem.

EugeniaVoytik commented 3 years ago

Thanks a lot for your answer! No, new columns shouldn't be a problem. I would be really thankful for a small example file to reproduce the bug that will help us to solve it. 🙏

kaisengit commented 3 years ago

I have created a minimal example which shows the problem: minimal_report.zip

EugeniaVoytik commented 3 years ago

Thanks a lot for sending your file! I've checked it and the issue is that in the Protein.Ids column you need to have the Uniprot unique entry identifier, e.g. P01308 for the human insulin protein. As I mentioned before, you may always look at the example file. Unique Protein.Ids are needed to combine the user experimental data in AlphaMap with the earlier mined Uniprot data. We'll add more detailed information about the content of these columns in the instructions for the supported proteomics software tools to prevent any user's confusion.

vdemichev commented 3 years ago

In connection to this case when Protein.Ids does not contain valid IDs, is there any specific rationale for using Protein.Ids and not Protein.Group? Protein.Ids normally contains all mapped proteins, while Protein.Group - the proteins inferred with a maximum parsimony algorithm.

ibludau commented 3 years ago

Initially our idea of AlphaMap was to visualize proteomics data on the peptide level without making any assumptions with regard to protein inference. The protein grouping strategies between software tools vary and if individual peptides are of interest you might miss important information on possible parental proteins/genes (btw. the hover info shows all protein ids that a peptide can be mapped to).

That being said, I understand that it might be misleading for the more 'protein focussed' people to have a peptide shown for proteins that are not included in the assigned protein group. We will think about this again - maybe we can introduce a parameter for users to choose whether to show a peptide for all 'Protein.Ids' or only the 'Protein.Group' accessions.

vdemichev commented 3 years ago

Oh, makes sense, thank you for the explanation!

kaisengit commented 3 years ago

Thanks a lot for sending your file! I've checked it and the issue is that in the Protein.Ids column you need to have the Uniprot unique entry identifier, e.g. P01308 for the human insulin protein. As I mentioned before, you may always look at the example file. Unique Protein.Ids are needed to combine the user experimental data in AlphaMap with the earlier mined Uniprot data. We'll add more detailed information about the content of these columns in the instructions for the supported proteomics software tools to prevent any user's confusion.

That makes sense, thank you for having a look. I used a custom fasta for my search and I guess this the reason why the Uniprot IDs were not included in the report. I'll close my bug report then.