IGS / gEAR

The gEAR Portal was created as a data archive and viewer for gene expression data including microarrays, bulk RNA-Seq, single-cell RNA-Seq and more.
https://umgear.org
GNU Affero General Public License v3.0
14 stars 4 forks source link

individual projection error #847

Closed carlocolantuoni closed 3 months ago

carlocolantuoni commented 3 months ago

when i run this projection: https://nemoanalytics.org/projection.html?projection_algorithm=pca&multipattern_plots=0&projection_source=f61159d5&layout_id=f1b93141&projection_patterns=FC

for this dataset: "Expression data from reactive astrocytes acutely purified from young adult mouse brains"

i am getting this error: "Could not create projection AnnData object from CSV."

can you figre out why?

adkinsrs commented 3 months ago

Found error with a sample name - 1dayaftersham. - "Count not convert string to float". If I look at the first sample line "GSE35338_Biomat_20___BioAssayId=165640Name=Astrocyte,1dayaftersham,biologicalrep4",253.03062 it seems to me that the scanpy read_csv function is ignoring the quotes in the sample name, then breaking that up. Scanpy's function has it set to treat the second column onwards as a float (since it should be data).

I feel like we have observed this before, but I need to see if it is in past emails perhaps. It may have been with using dots in the sample name.


There is also a "failed with error status 500" dataset towards the bottom... explanation below

Looks like that particular default curation that is trying to be plotted is using a saved analysis but the "user_saved" analysis directory (and file) for this dataset is not on the filesystem. @jorvis, I'm not sure how you created this VM but is there a chance some user_saved analysis did not transfer over?

@carlocolantuoni if you are pressed for time, maybe for this dataset you can curate a new dataset curation using the primary analysis default, and when you save, check the "make default display" option

adkinsrs commented 3 months ago

I tested reading the CSV file in the python REPL to see if I can reproduce without all the extraneous code, and I got the same error. I'm going to create a ticket in the Anndata github repo

carlocolantuoni commented 3 months ago

thnx - i can make another view and set as default as u suggest while you work on it - thnx

carlocolantuoni commented 3 months ago

making the new view didnt seem to help.

btw - there is no prob in gene viewing, just projection - in the past i know i have seen successful plots of projection in this dataset, so as u say shaun, it might be related to what did and did not get transferred

adkinsrs commented 3 months ago

making the new view didnt seem to help.

btw - there is no prob in gene viewing, just projection - in the past i know i have seen successful plots of projection in this dataset, so as u say shaun, it might be related to what did and did not get transferred

I think you did the wrong dataset. I was referring to the "Mouse (6 months), scRNA-seq, immune cells from whole brains of AD model (5xFAD) (Amit)" dataset that is missing the saved analysis file, not the first discussed dataset. The Amit dataset shows the same error on expression and projection views.

carlocolantuoni commented 3 months ago

o - ok.. when i try to curate a view for that one, there are no options in the metadata pulldown menus - only "expression"

adkinsrs commented 3 months ago

Oh ok, then that means the current default curation was using metadata from the saved analysis then.

also there is a ticket created in the "anndata" repo for the other dataset issue

flying-sheep commented 3 months ago

Hi, since the CSV format isn’t a standard and rather “whatever the developer of whatever language or framework felt like at the time”, there are no CSV parsing/writing bugs, just choices.

If you want to be able to rely on reading a file you wrote, avoid CSV/TSV/… as an intermediary format.

If you know exactly what CSV settings you need to read a specific CSV file, use pandas’ read_csv function and the AnnData constructor to create an object from the expression/metadata parts of the data frame.

adkinsrs commented 3 months ago

https://github.com/scverse/anndata/issues/1573 will not be worked on. So two things need to happen.

  1. I need to write a workaround so that this CSV can be written. Since these sample names work in h5ad, what I will do is create a mapping file to some generic names, load that CSV as h5ad, then rename back to the original sample names. EDIT: I missed the reply above
  2. @jorvis I would recommend the upload validator be modified to disallow the use of commas in sample names, if possible. I believe we have also seen issues with periods as well, but commas are what cause the parsing issue. EDIT: maybe not as important, if the solution above works.
adkinsrs commented 3 months ago

Hi, since the CSV format isn’t a standard and rather “whatever the developer of whatever language or framework felt like at the time”, there are no CSV parsing/writing bugs, just choices.

If you want to be able to rely on reading a file you wrote, avoid CSV/TSV/… as an intermediary format.

If you know exactly what CSV settings you need to read a specific CSV file, use pandas’ read_csv function and the AnnData constructor to create an object from the expression/metadata parts of the data frame.

Just saw your reply (was en route to work so I missed it)... the original use-case was take a valid Anndata object and replace the data in adata.X and adata.var with the data from the CSV. My strategy was to use Anndata.read_csv to populate the X, and replace our "adata.var" contents with the columns from the CSV.

I'll try the pandas read_csv and Anndata constructor and see if this is a simpler solution than what I was proposing in my previous comment. Thanks for the info @flying-sheep!

adkinsrs commented 3 months ago

The previous commit resolved the issue @carlocolantuoni was having

code here

        import anndata
        import pandas as pd

        dataset_adata = <current adata to replace data from>

        # READ CSV to make X and var
        df = pd.read_csv(projection_csv_path, sep=',', index_col=0, header=0)
        X = df.to_numpy()
        var = pd.DataFrame(index=df.columns)
        obs = dataset_adata.obs
        obsm = dataset_adata.obsm
        # Create the anndata object and write to h5ad
        # Associate with a filename to ensure AnnData is read in "backed" mode
        projection_adata = anndata.AnnData(X=X, obs=obs, var=var, obsm=obsm, filename=projection_adata_path, filemode='r')

        # For some reason the gene_symbol is not taken in by the constructor
        projection_adata.var["gene_symbol"] = projection_adata.var_names

        ### use projection_adata downstream in place of dataset_adata

Once again, thanks for the suggestion and assistance @flying-sheep

flying-sheep commented 3 months ago

great, happy to be of assistance!

adkinsrs commented 3 months ago

Linking #623 for the second dataset where analysis is missing (presumably deleted)

carlocolantuoni commented 3 months ago

thnx guys!

On Mon, Aug 5, 2024 at 12:03 PM Shaun Adkins @.***> wrote:

Linking #623 https://github.com/IGS/gEAR/issues/623 for the second dataset where analysis is missing (presumably deleted)

— Reply to this email directly, view it on GitHub https://github.com/IGS/gEAR/issues/847#issuecomment-2269416910, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7UPARO3Z7V5WZ3NUMTZP6O5RAVCNFSM6AAAAABL6FLKM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRZGQYTMOJRGA . You are receiving this because you were mentioned.Message ID: @.***>

-- Carlo