There appear to be some issues when parsing well_id, particularly in the embedding.parquet files from sources 1, 2 and 7 of the JUMP dataset. The well_id listed in the index corresponds to the previous "segment" of the key.
I used this filtering to check it:
Source 15 of the JUMP dataset seem to have a different key structure than the others sources, which leads to a number of parsing errors (which are not recognised as parsing errors according to the is_parsing_error column of the index):
The dataset_id is 'jump' and not 'cpg0016-jump'
There are 460 unique 'plate_id' values from that source, but only 183 of those follow the expected structure.
# There are 460 plate_ids for source 15 in JUMP, are there really 460 plates?
# also, plate_id varies in structure!
df = (index
.filter(pl.col("dataset_id").eq("jump"))
.filter(pl.col("source_id").eq("source_15"))
.unique(subset=["plate_id"])
.select(pl.col(["plate_id"]))
.collect(streaming=True)
)
df
# For source 15 in JUMP, are there really 460 plates?
# There are 183 unique ones matching the regex for the plate name structure
df = (index
.filter(pl.col("dataset_id").eq("jump"))
.filter(pl.col("source_id").eq("source_15"))
.filter(pl.col("plate_id").str.contains("^PE(P|C)[0-9]{8}$"))
.select(pl.col(["plate_id"]).unique().sort())
.collect(streaming=True)
)
df
There appear to be some issues when parsing well_id, particularly in the embedding.parquet files from sources 1, 2 and 7 of the JUMP dataset. The well_id listed in the index corresponds to the previous "segment" of the key. I used this filtering to check it:
Source 15 of the JUMP dataset seem to have a different key structure than the others sources, which leads to a number of parsing errors (which are not recognised as parsing errors according to the is_parsing_error column of the index):