Some parsing issues - Githubissues

There appear to be some issues when parsing well_id, particularly in the embedding.parquet files from sources 1, 2 and 7 of the JUMP dataset. The well_id listed in the index corresponds to the previous "segment" of the key. I used this filtering to check it:

df = (
    index
    .unique(subset="well_id")
    .filter(pl.col("is_parsing_error").eq(False)) 
    .select("well_id", "key", "dataset_id", "source_id", "leaf_node")
    .unique(subset=["dataset_id","leaf_node","source_id"])
    .collect(streaming=True)
    )
df

Source 15 of the JUMP dataset seem to have a different key structure than the others sources, which leads to a number of parsing errors (which are not recognised as parsing errors according to the is_parsing_error column of the index):

The dataset_id is 'jump' and not 'cpg0016-jump'

There are 460 unique 'plate_id' values from that source, but only 183 of those follow the expected structure.

# There are 460 plate_ids for source 15 in JUMP, are there really 460 plates? 
# also, plate_id varies in structure!
df = (index
  .filter(pl.col("dataset_id").eq("jump"))
  .filter(pl.col("source_id").eq("source_15"))
  .unique(subset=["plate_id"])
  .select(pl.col(["plate_id"]))
  .collect(streaming=True)
  )
df

# For source 15 in JUMP, are there really 460 plates? 
# There are 183 unique ones matching the regex for the plate name structure
df = (index
  .filter(pl.col("dataset_id").eq("jump"))
  .filter(pl.col("source_id").eq("source_15"))
  .filter(pl.col("plate_id").str.contains("^PE(P|C)[0-9]{8}$"))
  .select(pl.col(["plate_id"]).unique().sort())
  .collect(streaming=True)
  )
df

broadinstitute / cpg

Some parsing issues #34