amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
288 stars 66 forks source link

Why SNAP is not distributed naturally and does not support ADAM format directly? #25

Closed alartin closed 3 years ago

alartin commented 10 years ago

I just wonder why snap does not use mapreduce of hadoop/spark as the mapping engine and why it does not support ADAM directly.

bolosky commented 10 years ago

We’re working on it. We’ve got support in SNAP to use stdin/stdout, and we have a Hadoop record reader for FASTQ (and paired interleaved FASTQ, which is about what it sounds like). We’ve also got some code to read the index from HDFS. I’m not quite sure when this will be ready to go into the beta, but probably not all that long. It’s all visible in the dev branch, but we make no representations about how well that works, since we change it all the time.

The ADAM support is planned, but we haven’t started coding it yet.

--Bill

From: XinWu [mailto:notifications@github.com] Sent: Sunday, March 16, 2014 8:26 PM To: amplab/snap Subject: [snap] Why SNAP is not distributed naturally and does not support ADAM format directly? (#25)

I just wonder why snap does not use mapreduce of hadoop/spark as the mapping engine and why it does not support ADAM directly.

— Reply to this email directly or view it on GitHubhttps://github.com/amplab/snap/issues/25.

alartin commented 10 years ago

Hi Bill, Sounds great! Actually I think FASTQ is not a good format for NGS and distributed computing at all. Is there any plan to use Parquet format like ADAM does to store reads, scores, devices, lanes, tiles, etc? So hadoop/spark can read the compressed files directly from HDFS.

bolosky commented 10 years ago

We’ll eventually get ADAM support in, though I’m not quite sure when. I’m not particularly enamored with FASTQ, either, but it’s what’s coming off of sequencing machines now so we’ll continue to have to support it indefinitely.

--Bill

From: XinWu [mailto:notifications@github.com] Sent: Thursday, March 20, 2014 1:08 AM To: amplab/snap Cc: Bill Bolosky Subject: Re: [snap] Why SNAP is not distributed naturally and does not support ADAM format directly? (#25)

Hi Bill, Sounds great! Actually I think FASTQ is not a good format for NGS and distributed computing at all. Is there any plan to use Parquet format like ADAM does to store reads, scores, devices, lanes, tiles, etc? So hadoop/spark can read the compressed files directly from HDFS.

— Reply to this email directly or view it on GitHubhttps://github.com/amplab/snap/issues/25#issuecomment-38142875.

fnothaft commented 10 years ago

@alartin we have some work to support this at https://github.com/bigdatagenomics/avocado/tree/snap. This is a bit outdated, but I hope to have an update pushed soon.

gurvindersingh commented 9 years ago

any update on writing ADAM files directly from SNAP. As this will help in reducing one extra step require to convert SAM/BAM to ADAM later on.

bolosky commented 9 years ago

I’m not working on it. I don’t know how much (if any) progress people on the ADAM side have made.

From: Gurvinder Singh [mailto:notifications@github.com] Sent: Friday, May 1, 2015 11:50 AM To: amplab/snap Cc: Bill Bolosky Subject: Re: [snap] Why SNAP is not distributed naturally and does not support ADAM format directly? (#25)

any update on writing ADAM files directly from SNAP. As this will help in reducing one extra step require to convert SAM/BAM to ADAM later on.

— Reply to this email directly or view it on GitHubhttps://github.com/amplab/snap/issues/25#issuecomment-98204252.

gurvindersingh commented 9 years ago

Ok thanks Bill.

I have looked at the ADAM project is seems to be using SNAP in their avocado project to align data. It uses SNAPs stdin and stdout functionality. There is one issue though, it seems that SNAP can not read refernece index from HDFS, so it foces the reference index to be located on all the spark worker nodes, which seems bit of overkill. It would be nice if SNAP can read reference genome index from HDFS. It might be there and I am missing it, can you please let me know status about it.

bolosky commented 9 years ago

Jeremy Elson did some work to make SNAP read directly from HDFS, so I think it can do it. I believe all you need to do is to prefix the filename with hdfs:, so you can do:

Snap paired hdfs:/users/Bolosky/indices/hg19-20 hdfs:/users/Bolosky/reads/sample_reads.sam –o hdfs:/users/Bolosky/sample_snap_aligned.bam

I’m pretty sure that you can map files from HDFS, so you’ll wind up reading and copying the entire index, which is somewhat painful.

I haven’t used it, so YMMV. It may be that the HDFS support only works on Windows, and I do think that it requires SNAP to be compiled with a flag set.

--Bill

From: Gurvinder Singh [mailto:notifications@github.com] Sent: Saturday, May 2, 2015 11:29 AM To: amplab/snap Cc: Bill Bolosky Subject: Re: [snap] Why SNAP is not distributed naturally and does not support ADAM format directly? (#25)

Ok thanks Bill.

I have looked at the ADAM project is seems to be using SNAP in their avocado project to align data. It uses SNAPs stdin and stdout functionality. There is one issue though, it seems that SNAP can not read refernece index from HDFS, so it foces the reference index to be located on all the spark worker nodes, which seems bit of overkill. It would be nice if SNAP can read reference genome index from HDFS. It might be there and I am missing it, can you please let me know status about it.

— Reply to this email directly or view it on GitHubhttps://github.com/amplab/snap/issues/25#issuecomment-98384404.

gurvindersingh commented 9 years ago

I will test it out.. do you know which flag over the top ?