broadinstitute / cpg

Cell painting gallery
BSD 3-Clause "New" or "Revised" License
2 stars 3 forks source link

Some parsing issues #34

Open emiglietta opened 4 months ago

emiglietta commented 4 months ago

There appear to be some issues when parsing well_id, particularly in the embedding.parquet files from sources 1, 2 and 7 of the JUMP dataset. The well_id listed in the index corresponds to the previous "segment" of the key. I used this filtering to check it:

df = (
    index
    .unique(subset="well_id")
    .filter(pl.col("is_parsing_error").eq(False)) 
    .select("well_id", "key", "dataset_id", "source_id", "leaf_node")
    .unique(subset=["dataset_id","leaf_node","source_id"])
    .collect(streaming=True)
    )
df

Source 15 of the JUMP dataset seem to have a different key structure than the others sources, which leads to a number of parsing errors (which are not recognised as parsing errors according to the is_parsing_error column of the index):