facebookresearch / segment-anything

The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
47k stars 5.56k forks source link

Sam decoder model is returning the error : Cannot read property 'buffer' of undefined, when i try to load the sam encoder image embeddings with .txt file extension. Using ONNX react-native runtime. #726

Open Saikumar-gadde517 opened 6 months ago

Saikumar-gadde517 commented 6 months ago

I have the encoder image_embeddings in a text file in my root project directory. When i try to read the text file with encoder embeddings, the react-native can able to read the file. But if i pass the data to the decoder model its not even reading the image_embeddings from the text file. Its returning this error: Cannot read property 'buffer' of undefined.

Here is my code please check it.

  const decoderModelPath = `${RNFS.TemporaryDirectoryPath}/vit_h_decoder.onnx`;

  console.log('Decoder model path is loading....');

  await RNFS.downloadFile({
    fromUrl: Image.resolveAssetSource(
      require('./models/vit_h_decoder.onnx'),
    ).uri,
    toFile: decoderModelPath,
  }).promise;

  console.log('Decoder model is started processing....');

  const decoderSession = await ort.InferenceSession.create(
    'file://' + decoderModelPath,
  );

  console.log('Decoder model is loaded....');

  const txtFile = `${RNFS.TemporaryDirectoryPath}/react-embeddings.txt`;

  await RNFS.downloadFile({
    fromUrl: Image.resolveAssetSource(require('./react-embeddings.txt'))
      .uri,
    toFile: txtFile,
  }).promise;

  console.log('Embeddings are loading...Please wait....');

  const fileEmbeddings = await FileSystem.readFile(txtFile);

  console.log('Embeddings are going to parse....');

  const parseDataFile = JSON.parse(fileEmbeddings);

  console.log(Object.keys(parseDataFile));
  console.log('Embeddings are Parsed successfully....');

  console.log('Feed is going to load....');

  const feed = {
    image_embeddings: parseDataFile,
    point_coords: new ort.Tensor(
      new Float32Array([10, 10, 0, 0]),
      [1, 2, 2],
    ),
    point_labels: new ort.Tensor(new Float32Array([0, -1]), [1, 2]),
    mask_input: new ort.Tensor(
      new Float32Array(256 * 256),
      [1, 1, 256, 256],
    ),
    has_mask_input: new ort.Tensor(new Float32Array([0]), [1]),
    orig_im_size: new ort.Tensor(new Float32Array([684, 1024]), [2]),
  };

  console.log('Feed is loaded...');
  const finalData = await decoderSession.run(feed);

  const filePath = `${RNFS.DocumentDirectoryPath}/example.txt`;

  await RNFS.writeFile(
    filePath,
    JSON.stringify(finalData.masks.data),
    'utf8',
  );

  await Share.open({
    title: 'Share file',
    url: `file://${filePath}`,
  });

  console.log('Done with the decoder model');
heyoeyo commented 6 months ago

One thing to check is whether the JSON.parse(fileEmbeddings) part is completing successfully (I'm not sure if the error is occurring during/after that). It's possible that the embedding wasn't saved in a way that is json-compatible, and therefore can't be loaded/parsed properly.

The other (related) suggestion would be to check the data type of the image embedding input. In the feed variable, every entry is an ort.Tensor, except the embedding, which is some json-compatible type (since it comes from JSON.parse(...)). It seems likely that the embedding needs to be formatted as a tensor as well, and that may be the (indirect) cause of the error message.

CriusFission commented 5 months ago

Hi @heyoeyo , I tried to do this in react native. I created embeddings in python runtime, then sent those embeddings to react native and ran the decoder with that embedding and I got an output mask as well. But when I displayed the mask using a python script, it looks very weird.

Original image: small_frame

Mask: image

After I ran the decoder, I saved it to local as json text file and read the masks key and printed it out to get the above result. Am I missing something here? Do I have to do any further post processing to get a binary mask? Any help is appreciated.

heyoeyo commented 5 months ago

Hi @CriusFission there may be a few things off here, but it's hard to say for sure.

The main thing that stands out as strange is the size of your mask, it looks to be something like 1024x700 pixels, whereas the input image is 480x640 (?). Following the SAM mask sizing is confusing because there are a bunch of steps, but I'll try to list it out...

  1. First the 'raw' mask comes out at a size of 256x256, regardless of your input image
  2. This then gets scaled up to 1024x1024
  3. After scaling, it gets cropped to undo the padding that gets added to the original transformed input (which would have a size of 768x1024 in this case)
  4. Then finally this cropped version gets scaled back to the original input image size (480x640).

One (or more) of these steps seems to have gone wrong here, because the mask isn't the right size and the padding is still visible. My best guess would be that some height/width values got swapped somewhere (step 3 most likely?), since the aspect ratio of the mask is flipped compared to the original input. If you're using the original SAM code, it's probably worth dropping a bunch of print(masks.shape) statements throughout the postprocess_masks function to try to see what's going on.

Aside from that, the mask image looks like the raw output from the model. The original SAM model doesn't output a binary mask directly, instead this comes from the last processing steps, where the mask gets thresholded (pixels > 0). That step is definitely missing here, so that's something to try adding in.

One last concern is that the mask is very low-contrast, which means the segmentation mask (even after thresholding) wouldn't have been anything meaningful (maybe the top-left corner...?). That sort of suggests that the input prompt may not be formatted correctly (unless you just put in a (0,0) point for testing, that would make sense), since nothing seems to be selected. So it's probably worth double checking that the input prompts are formatted/scaled correctly (if there was a swapping of width/height somewhere, that might mean the prompt was swapped as well which could've placed it outside the image).