facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
31.58k stars 3.65k forks source link

Update pinned numpy in github action #3974

Open tarang-jain opened 1 month ago

tarang-jain commented 1 month ago

Pin numpy version to < 2 in github action

tarang-jain commented 1 month ago

@asadoughi I am surprised by how the seg fault in RAFT builds has suddenly arrived. My guess was a numpy version mismatch. The conda envs for RAFT 24.06 have numpy<2, which is why I pinned numpy=1.26.4. Upon running a valgrind on the torch tests, I see this:

...==1912667== Conditional jump or move depends on uninitialised value(s)
==1912667==    at 0x1270B55B: at::native::_to_copy(at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==    by 0x1341BD55: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd___to_copy>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat> > >, at::Tensor (at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==    by 0x12B12C28: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==    by 0x13264C8B: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>), &at::(anonymous namespace)::_to_copy>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat> > >, at::Tensor (at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==    by 0x12C23375: at::_ops::_to_copy::call(at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==    by 0x1270969E: at::native::to(at::Tensor const&, c10::ScalarType, bool, bool, std::optional<c10::MemoryFormat>) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==    by 0x135F2CF3: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::ScalarType, bool, bool, std::optional<c10::MemoryFormat>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd_dtype_to>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ScalarType, bool, bool, std::optional<c10::MemoryFormat> > >, at::Tensor (at::Tensor const&, c10::ScalarType, bool, bool, std::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ScalarType, bool, bool, std::optional<c10::MemoryFormat>) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==    by 0x12DA753C: at::_ops::to_dtype::call(at::Tensor const&, c10::ScalarType, bool, bool, std::optional<c10::MemoryFormat>) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==    by 0x12087C84: at::TensorIteratorBase::compute_types(at::TensorIteratorConfig const&) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==    by 0x120899F5: at::TensorIteratorBase::build(at::TensorIteratorConfig&) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==    by 0x1208AC24: at::TensorIteratorBase::build_borrowing_binary_op(at::TensorBase const&, at::TensorBase const&, at::TensorBase const&) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==    by 0x1239341E: at::meta::structured_add_Tensor::meta(at::Tensor const&, at::Tensor const&, c10::Scalar const&) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)
==1912667==  Uninitialised value was created by a stack allocation
==1912667==    at 0x12087320: at::TensorIteratorBase::compute_types(at::TensorIteratorConfig const&) (in /home/miniconda3/envs/faiss-main/lib/libtorch_cpu.so)

which makes me wonder that downgrading torch might help. Please let me know if you have any suggestions. This exact same action was working earlier, right? If the github action was unchanged, it makes me wonder that this has to do something with version compatibility of some of the packages since the action does not pin versions for any of the packages.

asadoughi commented 1 month ago

We can look into version pinning for all packages involved for the RAFT CI. Do you have a compatibility version of torch for RAFT 24.06? More generally, is there a published compatibility matrix for each version of RAFT?