Closed alartin closed 3 years ago
We’re working on it. We’ve got support in SNAP to use stdin/stdout, and we have a Hadoop record reader for FASTQ (and paired interleaved FASTQ, which is about what it sounds like). We’ve also got some code to read the index from HDFS. I’m not quite sure when this will be ready to go into the beta, but probably not all that long. It’s all visible in the dev branch, but we make no representations about how well that works, since we change it all the time.
The ADAM support is planned, but we haven’t started coding it yet.
--Bill
From: XinWu [mailto:notifications@github.com] Sent: Sunday, March 16, 2014 8:26 PM To: amplab/snap Subject: [snap] Why SNAP is not distributed naturally and does not support ADAM format directly? (#25)
I just wonder why snap does not use mapreduce of hadoop/spark as the mapping engine and why it does not support ADAM directly.
— Reply to this email directly or view it on GitHubhttps://github.com/amplab/snap/issues/25.
Hi Bill, Sounds great! Actually I think FASTQ is not a good format for NGS and distributed computing at all. Is there any plan to use Parquet format like ADAM does to store reads, scores, devices, lanes, tiles, etc? So hadoop/spark can read the compressed files directly from HDFS.
We’ll eventually get ADAM support in, though I’m not quite sure when. I’m not particularly enamored with FASTQ, either, but it’s what’s coming off of sequencing machines now so we’ll continue to have to support it indefinitely.
--Bill
From: XinWu [mailto:notifications@github.com] Sent: Thursday, March 20, 2014 1:08 AM To: amplab/snap Cc: Bill Bolosky Subject: Re: [snap] Why SNAP is not distributed naturally and does not support ADAM format directly? (#25)
Hi Bill, Sounds great! Actually I think FASTQ is not a good format for NGS and distributed computing at all. Is there any plan to use Parquet format like ADAM does to store reads, scores, devices, lanes, tiles, etc? So hadoop/spark can read the compressed files directly from HDFS.
— Reply to this email directly or view it on GitHubhttps://github.com/amplab/snap/issues/25#issuecomment-38142875.
@alartin we have some work to support this at https://github.com/bigdatagenomics/avocado/tree/snap. This is a bit outdated, but I hope to have an update pushed soon.
any update on writing ADAM files directly from SNAP. As this will help in reducing one extra step require to convert SAM/BAM to ADAM later on.
I’m not working on it. I don’t know how much (if any) progress people on the ADAM side have made.
From: Gurvinder Singh [mailto:notifications@github.com] Sent: Friday, May 1, 2015 11:50 AM To: amplab/snap Cc: Bill Bolosky Subject: Re: [snap] Why SNAP is not distributed naturally and does not support ADAM format directly? (#25)
any update on writing ADAM files directly from SNAP. As this will help in reducing one extra step require to convert SAM/BAM to ADAM later on.
— Reply to this email directly or view it on GitHubhttps://github.com/amplab/snap/issues/25#issuecomment-98204252.
Ok thanks Bill.
I have looked at the ADAM project is seems to be using SNAP in their avocado project to align data. It uses SNAPs stdin and stdout functionality. There is one issue though, it seems that SNAP can not read refernece index from HDFS, so it foces the reference index to be located on all the spark worker nodes, which seems bit of overkill. It would be nice if SNAP can read reference genome index from HDFS. It might be there and I am missing it, can you please let me know status about it.
Jeremy Elson did some work to make SNAP read directly from HDFS, so I think it can do it. I believe all you need to do is to prefix the filename with hdfs:, so you can do:
Snap paired hdfs:/users/Bolosky/indices/hg19-20 hdfs:/users/Bolosky/reads/sample_reads.sam –o hdfs:/users/Bolosky/sample_snap_aligned.bam
I’m pretty sure that you can map files from HDFS, so you’ll wind up reading and copying the entire index, which is somewhat painful.
I haven’t used it, so YMMV. It may be that the HDFS support only works on Windows, and I do think that it requires SNAP to be compiled with a flag set.
--Bill
From: Gurvinder Singh [mailto:notifications@github.com] Sent: Saturday, May 2, 2015 11:29 AM To: amplab/snap Cc: Bill Bolosky Subject: Re: [snap] Why SNAP is not distributed naturally and does not support ADAM format directly? (#25)
Ok thanks Bill.
I have looked at the ADAM project is seems to be using SNAP in their avocado project to align data. It uses SNAPs stdin and stdout functionality. There is one issue though, it seems that SNAP can not read refernece index from HDFS, so it foces the reference index to be located on all the spark worker nodes, which seems bit of overkill. It would be nice if SNAP can read reference genome index from HDFS. It might be there and I am missing it, can you please let me know status about it.
— Reply to this email directly or view it on GitHubhttps://github.com/amplab/snap/issues/25#issuecomment-98384404.
I will test it out.. do you know which flag over the top ?
I just wonder why snap does not use mapreduce of hadoop/spark as the mapping engine and why it does not support ADAM directly.